The UNIX command line

Fundamentals of Data Science

Author

Jeremy Teitelbaum

Why learn the command line?

  1. Access to remote servers is generally purely CLI
  2. Process automation often relies on CLI
  3. CLI is a quick and efficient way to work with files and directories

UNIX

UNIX is a generic term for a family of operating systems dating back to the 1960’s. Many systems today are in the UNIX family. The most notable examples are

  1. Linux (actually a whole family of Linux OS’s) – derived from an open source system created by Linus Torvalds
  2. MacOS

In addition, Microsoft now supports a system called WSL (Windows subsystem for Linux) that allows you to work with Linux on a windows machine.

The GNU project

The GNU project is a collection of open source tools written (primarily) for the UNIX ecosystem. The GNU project includes shells (bash), compilers (gcc), text editors (emacs), and many other resources. Most Linux systems are closely integrated with GNU tools.

The UNIX shell

The shell is a program that provides access to a range of tools for working with files and directoriees, and which can launch other programs that can do pretty much anything.

One can write programs for the shell to execute (these are called shell scripts) or one can use the shell interactively.

There are many shells available but the three you are most likely to encounter are:

  • the bourne shell (sh). This is the simplest shell program and mostly occurs in shell scripts. It is the “lowest common denominator” of UNIX shells.
  • bash is the standard shell that is associated with the GNU toolkit
  • zsh has become a popular shell because of its flexibility and its many customization options.

On Windows, the gitbash package provides a bash shell that runs in the Windows environment.

Most interactive commands are the same regardless of which shell you use, but the programming languages for each shell are similar but definitely not the same.

Which shell am I running?

When you launch a terminal window on Linux or MacOS, the system starts a shell program in that window and you interact with that shell. On MacOS, the default shell is zsh. On most Linux installations, it is bash.

To see what’s happening on your computer, run the following command in a terminal window. Here, and later, the initial ‘$’ stands for the prompt you receive from the shell. Yours may be fancier. We’ll see later how to customize it.

$ echo $0

(Note: for a whole range of technical reasons this isn’t 100% guaranteed to work but it is almost certainly correct!)

Redirection, Pipes, and Filters

Standard input and standard output

When a UNIX command runs, it typically has three I/O pathways associated to it:

  • an input stream called the standard input.
  • an output stream called the standard output.
  • an output stream called the standard error output.

Programs which read from standard input and write to standard output are called filters.

Some key commands for this section:

  1. cat combines the files given as arguments and outputs them to standard output chained together. Without any files, reads from standard input and outputs to standard output.
  2. wc counts words, characters, and lines from the file on the command line, or from standard input, and outputs to standard output. wc -c, wc -l, and wc -w give the individual numbers.
  3. cut extracts a field from the file on the command line (or from standard input) and outputs to standard output.
  4. echo sends its command line to standard output.
  5. sort sorts files or standard input, outputs to standard output.
  6. grep looks for matching patterns in files or standard input, outputs matching lines to standard output.
  7. uniq finds unique lines in a sorted file or standard input.

Redirection

One of the most powerful features of the shell is its ability to redirect standard input, standard output, and standard error to other files, and thereby construct pipelines.

Redirection of output

$ ls > files.txt # the > sign sends standard output to the given file.
$ cat files.txt # cat types the file to standard output
data
files.txt
model2.py 
notes
$ ls -F # notice that files.txt has been created
data/ files.txt model2.py notes/

Ordinarily, redirecting output using > overwrites the target. But if you use >> you can append to the end of a file.

Pipes

A pipe between commands is written with |. A pipe means the output from the first command should be sent as input to the second command.

First, let’s look at the wc command.

$ wc files.txt
4 4 31 files.txt # lines words characters in files.txt
$ wc -l files.txt # just the lines please
4

Now we put wc into a pipeline with ls to count the number of files in our directory.

$ ls | wc -l  # the output of ls (one file per line) goes to wc -l which counts lines 
4

Here we combine the training and test files and count the number of characters.

$ ls data/*
test.csv training.csv
$ cat data/* | wc -l 
151

The standard input

Commands like wc and cat either use files specified as arguments or, if there aren’t any, they read from standard input.

$ wc 
Here, wc is reading this stuff (which comes from the terminal, i.e. standard input)
and is counting words, lines and so on.
I use CTRL-D to send an end of file to tell wc that I'm done.
^D 
3 37 86

You can redirect standard input using <.

$ wc < files.txt
4 4 31

The output is the line/word/character counts, but there’s no file name because the data comes from standard input via the ‘<’ operator.

Check-in: Explain what these commands do and why.

$ echo "Hello There"
$ echo "Hello There" > hello.txt
$ echo "Hello There" | wc 

Two useful commands are sort and cut.

The cut command extracts fields from a delimited file. You can specify the field separator and the fields you want. The default delimiter for cut is the TAB character.

WARNING: cut (and sort) are not sophisticated about quoted fields that contain commas, unlike, say, pandas or the tidyverse. So you may not always get what you are expecting if you have quoted fields that contain commas.

$ cut -f4 -d, data/training.csv | head
Region
Anvers
Anvers
Anvers
Anvers
Anvers
Anvers
Anvers
Anvers
Anvers
$ cut -f1-4 -d, data/test.csv | tail -3
PAL0910,66,Chinstrap penguin (Pygoscelis antarctica),Anvers
PAL0910,67,Chinstrap penguin (Pygoscelis antarctica),Anvers
PAL0910,68,Chinstrap penguin (Pygoscelis antarctica),Anvers
$ ls -l  | cut  -f1 -d' ' # here we use space as a delimiter.
$ cut -c1-10 data/training.csv | head -2 # first 10 characters.
studyName,
PAL0708,1,

The sort command sorts (not surprisingly). You can specify the file on the command line or use sort as a filter. Ordinarily it sorts on the entire line.

$ sort data/titanic_train.csv > data/titanic_train_sorted.csv
$ ls data
test.csv  titanic_test.csv  titanic_train.csv  titanic_train_sorted.csv  training.csv
$ head -3 data/titanic_train_sorted.csv
100,0,2,"Kantor, Mr. Sinai",male,34,1,0,244367,26,,S
101,0,3,"Petranec, Miss. Matilda",female,28,0,0,349245,7.8958,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14,1,0,237736,30.0708,,C

You can specify fields (sort key) with -f, numerical sort with -n, reverse with -r, case folding with -f, and a field separator with -t. The quoted names throw things off here!

$ sort -k4 -t, data/titanic_train.csv | head -3 # field four, sep=,
846,0,3,"Abbing, Mr. Anthony",male,42,0,0,C.A. 5547,7.55,,S
747,0,3,"Abbott, Mr. Rossmore Edward",male,16,1,1,C.A. 2673,20.25,,S
280,1,3,"Abbott, Mrs. Stanton (Rosa Hunt)",female,35,1,1,C.A. 2673,20.25,,S
$ sort -k4 -r -t, data/titanic_train.csv | head -2
423,0,3,"Zimmerman, Mr. Leo",male,29,0,0,315082,7.875,,S
241,0,3,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C

Sorting is normally done “alphabetically” in which case the order might not be what you expect for numbers. The -n flag forces numeric values to be used.

$ sort  data/titanic_train.csv | cut -f1 -d, | head -5 # alphabetical
100
101
10
102
103
$ sort -n data/titanic_train.csv | cut -f1 -d, | head -5 # numerical
PassengerId
1
2
3
4

Searching

The grep command searches for matches in its input and outputs any that it finds. There are several variants of grep and there are many, many options to the command.

The basics:

$ grep 'William' data/titanic_train.csv | head -5  
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
13,0,3,"Saundercock, Mr. William Henry",male,20,0,0,A/5. 2151,8.05,,S
18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13,,S
24,1,1,"Sloper, Mr. William Thompson",male,28,0,0,113788,35.5,A6,S
32,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C
$ grep '^8' data/titanic_train.csv | head -5
8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S
80,1,3,"Dowdell, Miss. Elizabeth",female,30,0,0,364516,12.475,,S
81,0,3,"Waelens, Mr. Achille",male,22,0,0,345767,9,,S
82,1,3,"Sheerlinck, Mr. Jan Baptist",male,29,0,0,345779,9.5,,S
83,1,3,"McDermott, Miss. Brigdet Delia",female,,0,0,330932,7.7875,,Q
$ grep '\bWilliam\b' data/titanic_train.csv | tail -5
803,1,1,"Carter, Master. William Thornton II",male,11,1,2,113760,120,B96 B98,S
811,0,3,"Alexander, Mr. William",male,26,0,0,3474,7.8875,,S
865,0,2,"Gill, Mr. John William",male,24,0,0,233866,13,,S
881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25,0,1,230433,26,,S
886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39,0,5,382652,29.125,,Q
$ grep '\bWilliam\b' data/titanic_train.csv | wc -l

You can do a lot of useful stuff by combining grep with other tools.

$ grep female data/titanic_train.csv > data/females_train.csv
$ head data/females_train.csv
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14,1,0,237736,30.0708,,C

UNIX tools aren’t as powerful as the csv tools in, for example, pandas, but you can still be clever. Suppose we want to get at the names in the titanic file. They have embedded commas but are set off with quotations

$ cut -f2 -d\" data/titanic_train.csv
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Braund, Mr. Owen Harris
Cumings, Mrs. John Bradley (Florence Briggs Thayer)
Heikkinen, Miss. Laina
Futrelle, Mrs. Jacques Heath (Lily May Peel)
$ cut -d\" -f3 data/titanic_train.csv | cut -f2 -d, | head -2
Survived
male
female

The uniq command can be used to see the different elements in the field.

$ cut -d\" -f3 data/titanic_train.csv | cut -f2 -d, | sort | uniq 

female
male
Survived
$ cut -d\" -f3 data/titanic_train.csv | cut -f2 -d, | sort | uniq -c
 53 
 282 female
 556 male
 1 Survived

Exercise

  1. Use UNIX tools to determine:
    • how many of each type of penguin there are in the penguins_training.csv and penguins_test.csv files.
    • how many males and females there are in each file.
  2. Combine the training and test files for the penguins into a single file (penguins.csv). Then make a file male_penguins.csv containing just the males.

Variables and Loops

Variables

You use names for variables (like x), but to refer to the value of a variable you use $x$.

$ x="hello" # no spaces!
$ echo $x
hello

The various UNIX shells are programming languages and they have the full set of capabilities: variables, logical statements, loops….

The syntax for loops is different for zsh and bash unfortunately.

$ for x in *.csv # bash
> do
> echo $x
> done
$ for x in *.csv # zsh
cmdor cursh cmdand cursh then else> echo $x

You can use loops to (for example) rename a bunch of files.

$ mkdir bkup
$ for x in *.csv 
cmdor cursh cmdand cursh then else> cp $x bkup/$x
$ mkdir bkup
$ for x in *.csv
> do
> cp $x bkup/$x
> done

Scripts

A shell script is a file containing a sequence of shell commands. When a shell starts, it executes a startup script stored in .zshrc or .bashrc. This is where you can put commands to customize your shell.

The environment

Every shell has an environment which is a bunch of variables that are used by programs. Elements of the environment are called environment variables. Sometimes you have to set an environment variable.

Some common and important shell environment variables are:

$ echo $HOME # home directory
$ echo $PATH # search path for commands
$ echo $USER # your user id
$ echo $PS1 # the shell prompt
$ env # print the entire environment

Other topics

  • Using wget to get information from the web.
  • Using ssh to make a remote connection.
  • Using rsync to copy files across computers.
  • Using tar or gzip to compress and uncompress archives of files.