$ echo $0The UNIX command line
Fundamentals of Data Science
Why learn the command line?
- Access to remote servers is generally purely CLI
- Process automation often relies on CLI
- CLI is a quick and efficient way to work with files and directories
UNIX
UNIX is a generic term for a family of operating systems dating back to the 1960’s. Many systems today are in the UNIX family. The most notable examples are
- Linux (actually a whole family of Linux OS’s) – derived from an open source system created by Linus Torvalds
- MacOS
In addition, Microsoft now supports a system called WSL (Windows subsystem for Linux) that allows you to work with Linux on a windows machine.
The GNU project
The GNU project is a collection of open source tools written (primarily) for the UNIX ecosystem. The GNU project includes shells (bash), compilers (gcc), text editors (emacs), and many other resources. Most Linux systems are closely integrated with GNU tools.
The UNIX shell
The shell is a program that provides access to a range of tools for working with files and directoriees, and which can launch other programs that can do pretty much anything.
One can write programs for the shell to execute (these are called shell scripts) or one can use the shell interactively.
There are many shells available but the three you are most likely to encounter are:
- the bourne shell (
sh). This is the simplest shell program and mostly occurs in shell scripts. It is the “lowest common denominator” of UNIX shells. bashis the standard shell that is associated with the GNU toolkitzshhas become a popular shell because of its flexibility and its many customization options.
On Windows, the gitbash package provides a bash shell that runs in the Windows environment.
Most interactive commands are the same regardless of which shell you use, but the programming languages for each shell are similar but definitely not the same.
Which shell am I running?
When you launch a terminal window on Linux or MacOS, the system starts a shell program in that window and you interact with that shell. On MacOS, the default shell is zsh. On most Linux installations, it is bash.
To see what’s happening on your computer, run the following command in a terminal window. Here, and later, the initial ‘$’ stands for the prompt you receive from the shell. Yours may be fancier. We’ll see later how to customize it.
(Note: for a whole range of technical reasons this isn’t 100% guaranteed to work but it is almost certainly correct!)
Redirection, Pipes, and Filters
Standard input and standard output
When a UNIX command runs, it typically has three I/O pathways associated to it:
- an input stream called the standard input.
- an output stream called the standard output.
- an output stream called the standard error output.
Programs which read from standard input and write to standard output are called filters.
Some key commands for this section:
catcombines the files given as arguments and outputs them to standard output chained together. Without any files, reads from standard input and outputs to standard output.wccounts words, characters, and lines from the file on the command line, or from standard input, and outputs to standard output.wc -c,wc -l, andwc -wgive the individual numbers.cutextracts a field from the file on the command line (or from standard input) and outputs to standard output.echosends its command line to standard output.sortsorts files or standard input, outputs to standard output.greplooks for matching patterns in files or standard input, outputs matching lines to standard output.uniqfinds unique lines in a sorted file or standard input.
Redirection
One of the most powerful features of the shell is its ability to redirect standard input, standard output, and standard error to other files, and thereby construct pipelines.
Redirection of output
$ ls > files.txt # the > sign sends standard output to the given file.
$ cat files.txt # cat types the file to standard output
data
files.txt
model2.py
notes
$ ls -F # notice that files.txt has been created
data/ files.txt model2.py notes/Ordinarily, redirecting output using > overwrites the target. But if you use >> you can append to the end of a file.
Pipes
A pipe between commands is written with |. A pipe means the output from the first command should be sent as input to the second command.
First, let’s look at the wc command.
$ wc files.txt
4 4 31 files.txt # lines words characters in files.txt
$ wc -l files.txt # just the lines please
4Now we put wc into a pipeline with ls to count the number of files in our directory.
$ ls | wc -l # the output of ls (one file per line) goes to wc -l which counts lines
4Here we combine the training and test files and count the number of characters.
$ ls data/*
test.csv training.csv
$ cat data/* | wc -l
151The standard input
Commands like wc and cat either use files specified as arguments or, if there aren’t any, they read from standard input.
$ wc
Here, wc is reading this stuff (which comes from the terminal, i.e. standard input)
and is counting words, lines and so on.
I use CTRL-D to send an end of file to tell wc that I'm done.
^D
3 37 86You can redirect standard input using <.
$ wc < files.txt
4 4 31The output is the line/word/character counts, but there’s no file name because the data comes from standard input via the ‘<’ operator.
Check-in: Explain what these commands do and why.
$ echo "Hello There"
$ echo "Hello There" > hello.txt
$ echo "Hello There" | wc Two useful commands are sort and cut.
The cut command extracts fields from a delimited file. You can specify the field separator and the fields you want. The default delimiter for cut is the TAB character.
WARNING: cut (and sort) are not sophisticated about quoted fields that contain commas, unlike, say, pandas or the tidyverse. So you may not always get what you are expecting if you have quoted fields that contain commas.
$ cut -f4 -d, data/training.csv | head
Region
Anvers
Anvers
Anvers
Anvers
Anvers
Anvers
Anvers
Anvers
Anvers
$ cut -f1-4 -d, data/test.csv | tail -3
PAL0910,66,Chinstrap penguin (Pygoscelis antarctica),Anvers
PAL0910,67,Chinstrap penguin (Pygoscelis antarctica),Anvers
PAL0910,68,Chinstrap penguin (Pygoscelis antarctica),Anvers
$ ls -l | cut -f1 -d' ' # here we use space as a delimiter.
$ cut -c1-10 data/training.csv | head -2 # first 10 characters.
studyName,
PAL0708,1,The sort command sorts (not surprisingly). You can specify the file on the command line or use sort as a filter. Ordinarily it sorts on the entire line.
$ sort data/titanic_train.csv > data/titanic_train_sorted.csv
$ ls data
test.csv titanic_test.csv titanic_train.csv titanic_train_sorted.csv training.csv
$ head -3 data/titanic_train_sorted.csv
100,0,2,"Kantor, Mr. Sinai",male,34,1,0,244367,26,,S
101,0,3,"Petranec, Miss. Matilda",female,28,0,0,349245,7.8958,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14,1,0,237736,30.0708,,CYou can specify fields (sort key) with -f, numerical sort with -n, reverse with -r, case folding with -f, and a field separator with -t. The quoted names throw things off here!
$ sort -k4 -t, data/titanic_train.csv | head -3 # field four, sep=,
846,0,3,"Abbing, Mr. Anthony",male,42,0,0,C.A. 5547,7.55,,S
747,0,3,"Abbott, Mr. Rossmore Edward",male,16,1,1,C.A. 2673,20.25,,S
280,1,3,"Abbott, Mrs. Stanton (Rosa Hunt)",female,35,1,1,C.A. 2673,20.25,,S
$ sort -k4 -r -t, data/titanic_train.csv | head -2
423,0,3,"Zimmerman, Mr. Leo",male,29,0,0,315082,7.875,,S
241,0,3,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,CSorting is normally done “alphabetically” in which case the order might not be what you expect for numbers. The -n flag forces numeric values to be used.
$ sort data/titanic_train.csv | cut -f1 -d, | head -5 # alphabetical
100
101
10
102
103
$ sort -n data/titanic_train.csv | cut -f1 -d, | head -5 # numerical
PassengerId
1
2
3
4Searching
The grep command searches for matches in its input and outputs any that it finds. There are several variants of grep and there are many, many options to the command.
The basics:
$ grep 'William' data/titanic_train.csv | head -5
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
13,0,3,"Saundercock, Mr. William Henry",male,20,0,0,A/5. 2151,8.05,,S
18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13,,S
24,1,1,"Sloper, Mr. William Thompson",male,28,0,0,113788,35.5,A6,S
32,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C
$ grep '^8' data/titanic_train.csv | head -5
8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S
80,1,3,"Dowdell, Miss. Elizabeth",female,30,0,0,364516,12.475,,S
81,0,3,"Waelens, Mr. Achille",male,22,0,0,345767,9,,S
82,1,3,"Sheerlinck, Mr. Jan Baptist",male,29,0,0,345779,9.5,,S
83,1,3,"McDermott, Miss. Brigdet Delia",female,,0,0,330932,7.7875,,Q
$ grep '\bWilliam\b' data/titanic_train.csv | tail -5
803,1,1,"Carter, Master. William Thornton II",male,11,1,2,113760,120,B96 B98,S
811,0,3,"Alexander, Mr. William",male,26,0,0,3474,7.8875,,S
865,0,2,"Gill, Mr. John William",male,24,0,0,233866,13,,S
881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25,0,1,230433,26,,S
886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39,0,5,382652,29.125,,Q
$ grep '\bWilliam\b' data/titanic_train.csv | wc -lYou can do a lot of useful stuff by combining grep with other tools.
$ grep female data/titanic_train.csv > data/females_train.csv
$ head data/females_train.csv
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14,1,0,237736,30.0708,,CUNIX tools aren’t as powerful as the csv tools in, for example, pandas, but you can still be clever. Suppose we want to get at the names in the titanic file. They have embedded commas but are set off with quotations
$ cut -f2 -d\" data/titanic_train.csv
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Braund, Mr. Owen Harris
Cumings, Mrs. John Bradley (Florence Briggs Thayer)
Heikkinen, Miss. Laina
Futrelle, Mrs. Jacques Heath (Lily May Peel)
$ cut -d\" -f3 data/titanic_train.csv | cut -f2 -d, | head -2
Survived
male
femaleThe uniq command can be used to see the different elements in the field.
$ cut -d\" -f3 data/titanic_train.csv | cut -f2 -d, | sort | uniq
female
male
Survived
$ cut -d\" -f3 data/titanic_train.csv | cut -f2 -d, | sort | uniq -c
53
282 female
556 male
1 SurvivedExercise
- Use UNIX tools to determine:
- how many of each type of penguin there are in the
penguins_training.csvandpenguins_test.csvfiles. - how many males and females there are in each file.
- how many of each type of penguin there are in the
- Combine the training and test files for the penguins into a single file (
penguins.csv). Then make a filemale_penguins.csvcontaining just the males.
Variables and Loops
Variables
You use names for variables (like x), but to refer to the value of a variable you use $x$.
$ x="hello" # no spaces!
$ echo $x
helloThe various UNIX shells are programming languages and they have the full set of capabilities: variables, logical statements, loops….
The syntax for loops is different for zsh and bash unfortunately.
$ for x in *.csv # bash
> do
> echo $x
> done$ for x in *.csv # zsh
cmdor cursh cmdand cursh then else> echo $xYou can use loops to (for example) rename a bunch of files.
$ mkdir bkup
$ for x in *.csv
cmdor cursh cmdand cursh then else> cp $x bkup/$x$ mkdir bkup
$ for x in *.csv
> do
> cp $x bkup/$x
> doneScripts
A shell script is a file containing a sequence of shell commands. When a shell starts, it executes a startup script stored in .zshrc or .bashrc. This is where you can put commands to customize your shell.
The environment
Every shell has an environment which is a bunch of variables that are used by programs. Elements of the environment are called environment variables. Sometimes you have to set an environment variable.
Some common and important shell environment variables are:
$ echo $HOME # home directory
$ echo $PATH # search path for commands
$ echo $USER # your user id
$ echo $PS1 # the shell prompt
$ env # print the entire environmentOther topics
- Using
wgetto get information from the web. - Using
sshto make a remote connection. - Using
rsyncto copy files across computers. - Using
tarorgzipto compress and uncompress archives of files.