Course Notes

Jupyter Lab Overview

Start Jupyter Lab

  • client server architecture
  • browser connects to server
  • kernels execute commands
  • different kernels available
  • terminals, text editor, notebook

You can arrange panels as you like them.

In [4]:
pwd
Out[4]:
'/home/jet08013/GitHub/Carpentries/2020-01-13/notes'

Jupyter Image

Notebook blends text (markdown) and code (python, in our case)

  • click in a box to activate it
  • esc gives you access to the notebook commands (menu, etc)
  • m makes a book markdown, y makes it code -- or use the menu at the top

Simple Markdown

  • for lists
  • lists
  • list items with hypen

emphasis bold

links

Latex

For those who know TeX, the notebook can render math:

$$\sum_{i=1}^{\infty}\frac{1}{n^2} = \frac{\pi^2}{6}$$

Python

Python is an interpreted language and the blocks of the notebook can be used as a calculator (a fancy one!)

Variables

In [9]:
name = "Jeremy"
age = 60
probability = .35
In [10]:
print(name, age, probability)
Jeremy 60 0.35
In [11]:
print(name, 'is', age,'years old with probability',probability)
Jeremy is 60 years old with probability 0.35

Variables must be created before being used, by having a value assigned to them.

In [12]:
temperature
Out[12]:
50
In [7]:
temperature=50
In [13]:
temperature*2
Out[13]:
100
In [14]:
print("the temperature is:", temperature)
the temperature is: 50
In [15]:
full_name = "Jeremy Teitelbaum"
In [16]:
full_name[3]
Out[16]:
'e'

In python, indexing starts from zero!!!

In [17]:
full_name[0]
Out[17]:
'J'
In [18]:
len(full_name)
Out[18]:
17

Types

  • strings
  • integers
  • floats
  • lists
In [25]:
a=123
b='123'
In [26]:
a[1]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-26-8bc71255a22e> in <module>
----> 1 a[1]

TypeError: 'int' object is not subscriptable
In [27]:
b[1]
Out[27]:
'2'
In [512]:
L = ['jeremy', 'kendra', 'parkisheet']
In [513]:
L[0]
Out[513]:
'jeremy'

Slicing

In [19]:
full_name[1:3]
Out[19]:
'er'
In [20]:
full_name[5:]
Out[20]:
'y Teitelbaum'
In [21]:
full_name[:5]
Out[21]:
'Jerem'
In [22]:
full_name[3:5:3]
Out[22]:
'e'
In [23]:
full_name[:-1]
Out[23]:
'Jeremy Teitelbau'
In [24]:
full_name[-1]
Out[24]:
'm'
In [28]:
full_name[-1::-1]
Out[28]:
'muabletieT ymereJ'
In [514]:
L = ['a','b','c','d']
In [515]:
L[1:3]
Out[515]:
['b', 'c']
In [516]:
L[-1]
Out[516]:
'd'
In [517]:
L[::2]
Out[517]:
['a', 'c']
In [29]:
magic_word = "thanos_must_die"
  • What command will determine how many letters are in the magic_word?
  • How would you print the fifth through eighth letters of the magic word (inclusive)?
In [31]:
print(magic_word[4:8])
print(magic_word[7:])
print(magic_word[:3])
print(magic_word[:])
print(magic_word[2:-3])
print(magic_word[-3:2:-1])
print(magic_word[0:100])
os_m
must_die
tha
thanos_must_die
anos_must_
d_tsum_son
thanos_must_die

Types

In python, every variable has a type, but the language figures out the appropriate type by itself.

The most important types are:

  • strings
  • integers
  • floating point
In [32]:
x="this is a string"
a=137.50
b=12e-1
c=133
In [33]:
print(a)
137.5
In [34]:
print(b)
1.2
In [35]:
print(c)
133
In [37]:
type(a)
Out[37]:
float
In [38]:
type(b)
Out[38]:
float
In [39]:
type(c)
Out[39]:
int
In [40]:
type(x)
Out[40]:
str

Operations on numbers and strings

They're what you'd expect, with some caveats.

In [44]:
x=3.5
y=x/5
z=x*22.4
w=x*1e-1
print("x=",x,"y=",y,"z=",z,"w=",w)
x= 3.5 y= 0.7 z= 78.39999999999999 w= 0.35000000000000003

String addition is concatenation.

In [45]:
x="Jeremy"+"Teitelbaum"
In [46]:
print(x)
JeremyTeitelbaum

Mixing floats and integers is ok (you get a float) but mixing strings and numbers is a problem.

In [50]:
1+3.5
Out[50]:
4.5
In [51]:
"Jeremy"+3
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-51-8b1195841f62> in <module>
----> 1 "Jeremy"+3

TypeError: can only concatenate str (not "int") to str

You can convert floats or ints to strings and then combine them:

In [55]:
"Jeremy Teitelbaum is "+60+" years old"
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-55-1d48ca8ed8ea> in <module>
----> 1 "Jeremy Teitelbaum is "+60+" years old"

TypeError: can only concatenate str (not "int") to str
In [56]:
"Jeremy Teitelbaum is "+str(60)+" years old"
Out[56]:
'Jeremy Teitelbaum is 60 years old'
In [58]:
3*"a"
Out[58]:
'aaa'
In [59]:
type("hello")
Out[59]:
str
In [60]:
type(60)
Out[60]:
int
In [61]:
type(3.6)
Out[61]:
float
In [62]:
len("Jeremy")
Out[62]:
6
In [63]:
len(3.5)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-63-5c30567bf151> in <module>
----> 1 len(3.5)

TypeError: object of type 'float' has no len()

Operations on lists

In [518]:
L = ['jeremy','kendra', 'pariksheet','dyanna']
In [519]:
S = ['mouse', 'cat']
In [520]:
L+S
Out[520]:
['jeremy', 'kendra', 'pariksheet', 'dyanna', 'mouse', 'cat']
In [521]:
sorted(L)
Out[521]:
['dyanna', 'jeremy', 'kendra', 'pariksheet']
In [526]:
['a']*3
Out[526]:
['a', 'a', 'a']
In [525]:
L = L + ['x']
print(L)
['jeremy', 'kendra', 'pariksheet', 'dyanna', 'x', 'x', 'x']

strings to lists and back

In [549]:
list('abcdefg')
Out[549]:
['a', 'b', 'c', 'd', 'e', 'f', 'g']
In [551]:
''.join(['a','b','c'])
Out[551]:
'abc'

Dictionaries

In [283]:
f = {}
f['jeremy']=15
f['kendra']=33
f['marshmallow']=100
f['french_fries']='hello'
In [284]:
f['jeremy']
Out[284]:
15
In [285]:
f['french_fries']
Out[285]:
'hello'
In [286]:
f[0]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-286-d8711d29ed7d> in <module>
----> 1 f[0]

KeyError: 0
In [287]:
f.keys()
Out[287]:
dict_keys(['jeremy', 'kendra', 'marshmallow', 'french_fries'])
In [288]:
f.values()
Out[288]:
dict_values([15, 33, 100, 'hello'])
In [289]:
a = dict(name=['jeremy','kendra','pariksheet'],gender=['m','f','m'])
In [279]:
a
Out[279]:
{'name': ['jeremy', 'kendra', 'pariksheet'], 'gender': ['m', 'f', 'm']}

Comments

In [64]:
# this goes in a code cell, but doesn't get executed

Built-in Functions

f(x,y,z) returns a value

  • len
  • int, str, float
  • print
  • max
  • min
  • round
  • help - function or shift-Tab
In [66]:
round(3.12131,3)
Out[66]:
3.121
In [71]:
max('hello')
Out[71]:
'o'
In [72]:
min("hello")
Out[72]:
'e'
In [73]:
round("hello")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-73-06fcaad98a98> in <module>
----> 1 round("hello")

TypeError: type str doesn't define __round__ method
In [74]:
1/2
Out[74]:
0.5
In [75]:
result = 60
In [77]:
report = "the result is "+str(result)
In [78]:
print(report)
the result is 60
In [465]:
#help(type)
In [80]:
help(round)
Help on built-in function round in module builtins:

round(number, ndigits=None)
    Round a number to a given precision in decimal digits.
    
    The return value is an integer if ndigits is omitted or None.  Otherwise
    the return value has the same type as the number.  ndigits may be negative.

In [81]:
round(123,-2)
Out[81]:
100
In [82]:
round(1351,-1)
Out[82]:
1350
In [466]:
#help()
In [467]:
report = "Now is the time to flee'
  File "<ipython-input-467-51897dfb7c04>", line 1
    report = "Now is the time to flee'
                                      ^
SyntaxError: EOL while scanning string literal
In [468]:
1/0
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-468-9e1622b385b6> in <module>
----> 1 1/0

ZeroDivisionError: division by zero
In [469]:
y = 15-23
x = y+8
z= 3/x
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-469-23b9d1a37b92> in <module>
      1 y = 15-23
      2 x = y+8
----> 3 z= 3/x

ZeroDivisionError: division by zero
In [470]:
max()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-470-f870ea12a3fc> in <module>
----> 1 max()

TypeError: max expected 1 arguments, got 0
In [471]:
min()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-471-9c9b7bdfea4e> in <module>
----> 1 min()

TypeError: min expected 1 arguments, got 0

Libraries

Python is a very simple language

  • There are only 33 "keywords" in the python language
    False, class, finally, is, return, None, continue, for, lambda, try, True, def, from, nonlocal, while, and, del, global, not, with, as, elif, if, or, yield, assert, else, import, pass, break, except, in, raise
    

All of the interesting capabilities of the language come from extensions to the basic language; these extensions are called "libraries" or "modules".

The language also includes about 100 built-in functions. The functions

int, len, print, str, type, max, min

are built-in library. The built-in functions are very primitive.

To get access to more interesting functions, you need to

import

them by importing the library that defines them. Many interesting capabilities are added by the standard library

Two important libraries

  • numpy
  • pandas
In [529]:
import numpy
import pandas
In [531]:
print('pi is ',numpy.pi)
pi is  3.141592653589793
In [532]:
print('e is ',numpy.exp(1))
e is  2.718281828459045

Notice that we refer to the function exp, for example, by first naming the library it came from.

In [533]:
exp(1)
Out[533]:
2.718281828459045
In [94]:
math.exp(1)
Out[94]:
2.718281828459045

You can get help on a library/module using the help command.

The import command also allows you to adopt abbreviations.

In [534]:
import numpy as np
In [535]:
np.exp(1)
Out[535]:
2.718281828459045
In [536]:
from numpy import exp
In [537]:
exp(1)
Out[537]:
2.718281828459045
In [538]:
import numpy.random
In [543]:
numpy.random.choice([1,2,3,4,5])
Out[543]:
3
In [554]:
np.random.choice(list('ACTTGCTTGAC'))
Out[554]:
'T'
In [556]:
import math 
import random
bases = "ACTTGCTTGAC" 
n_bases = len(bases)
idx = np.random.randint(n_bases)
print("random base", bases[idx], "base index", idx)
random base C base index 10
  • Data frames are "like" spreadsheets
    • Columns
    • Index
  • manipulation is through "methods"
In [557]:
import numpy.random as rnd
In [560]:
rnd.randint(20)
Out[560]:
11

Numpy arrays

In [700]:
a=np.arange(0,10,.1)
In [702]:
numpy.cos(a)
Out[702]:
array([ 1.        ,  0.99500417,  0.98006658,  0.95533649,  0.92106099,
        0.87758256,  0.82533561,  0.76484219,  0.69670671,  0.62160997,
        0.54030231,  0.45359612,  0.36235775,  0.26749883,  0.16996714,
        0.0707372 , -0.02919952, -0.12884449, -0.22720209, -0.32328957,
       -0.41614684, -0.5048461 , -0.58850112, -0.66627602, -0.73739372,
       -0.80114362, -0.85688875, -0.90407214, -0.94222234, -0.97095817,
       -0.9899925 , -0.99913515, -0.99829478, -0.98747977, -0.96679819,
       -0.93645669, -0.89675842, -0.84810003, -0.79096771, -0.7259323 ,
       -0.65364362, -0.57482395, -0.49026082, -0.40079917, -0.30733287,
       -0.2107958 , -0.11215253, -0.01238866,  0.08749898,  0.18651237,
        0.28366219,  0.37797774,  0.46851667,  0.55437434,  0.63469288,
        0.70866977,  0.77556588,  0.83471278,  0.88551952,  0.92747843,
        0.96017029,  0.98326844,  0.9965421 ,  0.99985864,  0.99318492,
        0.97658763,  0.95023259,  0.91438315,  0.86939749,  0.8157251 ,
        0.75390225,  0.68454667,  0.60835131,  0.52607752,  0.43854733,
        0.34663532,  0.25125984,  0.15337386,  0.05395542, -0.04600213,
       -0.14550003, -0.24354415, -0.33915486, -0.43137684, -0.51928865,
       -0.6020119 , -0.67872005, -0.74864665, -0.81109301, -0.86543521,
       -0.91113026, -0.9477216 , -0.97484362, -0.99222533, -0.99969304,
       -0.99717216, -0.98468786, -0.96236488, -0.93042627, -0.88919115])
In [703]:
3*a
Out[703]:
array([ 0. ,  0.3,  0.6,  0.9,  1.2,  1.5,  1.8,  2.1,  2.4,  2.7,  3. ,
        3.3,  3.6,  3.9,  4.2,  4.5,  4.8,  5.1,  5.4,  5.7,  6. ,  6.3,
        6.6,  6.9,  7.2,  7.5,  7.8,  8.1,  8.4,  8.7,  9. ,  9.3,  9.6,
        9.9, 10.2, 10.5, 10.8, 11.1, 11.4, 11.7, 12. , 12.3, 12.6, 12.9,
       13.2, 13.5, 13.8, 14.1, 14.4, 14.7, 15. , 15.3, 15.6, 15.9, 16.2,
       16.5, 16.8, 17.1, 17.4, 17.7, 18. , 18.3, 18.6, 18.9, 19.2, 19.5,
       19.8, 20.1, 20.4, 20.7, 21. , 21.3, 21.6, 21.9, 22.2, 22.5, 22.8,
       23.1, 23.4, 23.7, 24. , 24.3, 24.6, 24.9, 25.2, 25.5, 25.8, 26.1,
       26.4, 26.7, 27. , 27.3, 27.6, 27.9, 28.2, 28.5, 28.8, 29.1, 29.4,
       29.7])
In [704]:
a*a
Out[704]:
array([0.000e+00, 1.000e-02, 4.000e-02, 9.000e-02, 1.600e-01, 2.500e-01,
       3.600e-01, 4.900e-01, 6.400e-01, 8.100e-01, 1.000e+00, 1.210e+00,
       1.440e+00, 1.690e+00, 1.960e+00, 2.250e+00, 2.560e+00, 2.890e+00,
       3.240e+00, 3.610e+00, 4.000e+00, 4.410e+00, 4.840e+00, 5.290e+00,
       5.760e+00, 6.250e+00, 6.760e+00, 7.290e+00, 7.840e+00, 8.410e+00,
       9.000e+00, 9.610e+00, 1.024e+01, 1.089e+01, 1.156e+01, 1.225e+01,
       1.296e+01, 1.369e+01, 1.444e+01, 1.521e+01, 1.600e+01, 1.681e+01,
       1.764e+01, 1.849e+01, 1.936e+01, 2.025e+01, 2.116e+01, 2.209e+01,
       2.304e+01, 2.401e+01, 2.500e+01, 2.601e+01, 2.704e+01, 2.809e+01,
       2.916e+01, 3.025e+01, 3.136e+01, 3.249e+01, 3.364e+01, 3.481e+01,
       3.600e+01, 3.721e+01, 3.844e+01, 3.969e+01, 4.096e+01, 4.225e+01,
       4.356e+01, 4.489e+01, 4.624e+01, 4.761e+01, 4.900e+01, 5.041e+01,
       5.184e+01, 5.329e+01, 5.476e+01, 5.625e+01, 5.776e+01, 5.929e+01,
       6.084e+01, 6.241e+01, 6.400e+01, 6.561e+01, 6.724e+01, 6.889e+01,
       7.056e+01, 7.225e+01, 7.396e+01, 7.569e+01, 7.744e+01, 7.921e+01,
       8.100e+01, 8.281e+01, 8.464e+01, 8.649e+01, 8.836e+01, 9.025e+01,
       9.216e+01, 9.409e+01, 9.604e+01, 9.801e+01])
In [705]:
a+numpy.exp(a)
Out[705]:
array([1.00000000e+00, 1.20517092e+00, 1.42140276e+00, 1.64985881e+00,
       1.89182470e+00, 2.14872127e+00, 2.42211880e+00, 2.71375271e+00,
       3.02554093e+00, 3.35960311e+00, 3.71828183e+00, 4.10416602e+00,
       4.52011692e+00, 4.96929667e+00, 5.45519997e+00, 5.98168907e+00,
       6.55303242e+00, 7.17394739e+00, 7.84964746e+00, 8.58589444e+00,
       9.38905610e+00, 1.02661699e+01, 1.12250135e+01, 1.22741825e+01,
       1.34231764e+01, 1.46824940e+01, 1.60637380e+01, 1.75797317e+01,
       1.92446468e+01, 2.10741454e+01, 2.30855369e+01, 2.52979513e+01,
       2.77325302e+01, 3.04126389e+01, 3.33641000e+01, 3.66154520e+01,
       4.01982344e+01, 4.41473044e+01, 4.85011845e+01, 5.33024491e+01,
       5.85981500e+01, 6.44402876e+01, 7.08863310e+01, 7.79997937e+01,
       8.58508687e+01, 9.45171313e+01, 1.04084316e+02, 1.14647172e+02,
       1.26310418e+02, 1.39189780e+02, 1.53413159e+02, 1.69121907e+02,
       1.86472242e+02, 2.05636810e+02, 2.26806416e+02, 2.50191932e+02,
       2.76026407e+02, 3.04567401e+02, 3.36099560e+02, 3.70937468e+02,
       4.09428793e+02, 4.51957770e+02, 4.98949041e+02, 5.50871910e+02,
       6.08245038e+02, 6.71641633e+02, 7.41695189e+02, 8.19105825e+02,
       9.04647292e+02, 9.99174716e+02, 1.10363316e+03, 1.21906707e+03,
       1.34663076e+03, 1.48759993e+03, 1.64338443e+03, 1.81554241e+03,
       2.00579590e+03, 2.21604799e+03, 2.44840198e+03, 2.70518233e+03,
       2.98895799e+03, 3.30256808e+03, 3.64915031e+03, 4.03217239e+03,
       4.45546675e+03, 4.92326884e+03, 5.44025959e+03, 6.01161222e+03,
       6.64304401e+03, 7.34087354e+03, 8.11208393e+03, 8.96439270e+03,
       9.90632906e+03, 1.09473192e+04, 1.20977807e+04, 1.33692268e+04,
       1.47743816e+04, 1.63273072e+04, 1.80435449e+04, 1.99402704e+04])

DataFrames and Pandas

In [108]:
import pandas as pd

simple dataframes and dictionaries

In [280]:
df = pd.DataFrame.from_dict(a)
In [281]:
df
Out[281]:
name gender
0 jeremy m
1 kendra f
2 pariksheet m
In [562]:
data = pd.read_csv('../gapminder_data.csv')
In [563]:
data.columns
Out[563]:
Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')
In [564]:
data.head()
Out[564]:
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314
1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030
2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710
3 Afghanistan 1967 11537966.0 Asia 34.020 836.197138
4 Afghanistan 1972 13079460.0 Asia 36.088 739.981106
In [565]:
data = pd.read_csv('../gapminder_data.csv', index_col='country')
In [566]:
data.columns
Out[566]:
Index(['year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')
In [567]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1704 entries, Afghanistan to Zimbabwe
Data columns (total 5 columns):
year         1704 non-null int64
pop          1704 non-null float64
continent    1704 non-null object
lifeExp      1704 non-null float64
gdpPercap    1704 non-null float64
dtypes: float64(3), int64(1), object(1)
memory usage: 79.9+ KB

selecting elements from dataframes

columns

In [572]:
data['pop']
Out[572]:
country
Afghanistan     8425333.0
Afghanistan     9240934.0
Afghanistan    10267083.0
Afghanistan    11537966.0
Afghanistan    13079460.0
                  ...    
Zimbabwe        9216418.0
Zimbabwe       10704340.0
Zimbabwe       11404948.0
Zimbabwe       11926563.0
Zimbabwe       12311143.0
Name: pop, Length: 1704, dtype: float64
In [573]:
data[['year','pop']]
Out[573]:
year pop
country
Afghanistan 1952 8425333.0
Afghanistan 1957 9240934.0
Afghanistan 1962 10267083.0
Afghanistan 1967 11537966.0
Afghanistan 1972 13079460.0
... ... ...
Zimbabwe 1987 9216418.0
Zimbabwe 1992 10704340.0
Zimbabwe 1997 11404948.0
Zimbabwe 2002 11926563.0
Zimbabwe 2007 12311143.0

1704 rows × 2 columns

rows

from the index

In [574]:
data.loc['Afghanistan']
Out[574]:
year pop continent lifeExp gdpPercap
country
Afghanistan 1952 8425333.0 Asia 28.801 779.445314
Afghanistan 1957 9240934.0 Asia 30.332 820.853030
Afghanistan 1962 10267083.0 Asia 31.997 853.100710
Afghanistan 1967 11537966.0 Asia 34.020 836.197138
Afghanistan 1972 13079460.0 Asia 36.088 739.981106
Afghanistan 1977 14880372.0 Asia 38.438 786.113360
Afghanistan 1982 12881816.0 Asia 39.854 978.011439
Afghanistan 1987 13867957.0 Asia 40.822 852.395945
Afghanistan 1992 16317921.0 Asia 41.674 649.341395
Afghanistan 1997 22227415.0 Asia 41.763 635.341351
Afghanistan 2002 25268405.0 Asia 42.129 726.734055
Afghanistan 2007 31889923.0 Asia 43.828 974.580338

boolean selection on a column

In [575]:
data['continent']=='Asia'
Out[575]:
country
Afghanistan     True
Afghanistan     True
Afghanistan     True
Afghanistan     True
Afghanistan     True
               ...  
Zimbabwe       False
Zimbabwe       False
Zimbabwe       False
Zimbabwe       False
Zimbabwe       False
Name: continent, Length: 1704, dtype: bool
In [576]:
data[data['continent']=='Asia']
Out[576]:
year pop continent lifeExp gdpPercap
country
Afghanistan 1952 8425333.0 Asia 28.801 779.445314
Afghanistan 1957 9240934.0 Asia 30.332 820.853030
Afghanistan 1962 10267083.0 Asia 31.997 853.100710
Afghanistan 1967 11537966.0 Asia 34.020 836.197138
Afghanistan 1972 13079460.0 Asia 36.088 739.981106
... ... ... ... ... ...
Yemen Rep. 1987 11219340.0 Asia 52.922 1971.741538
Yemen Rep. 1992 13367997.0 Asia 55.599 1879.496673
Yemen Rep. 1997 15826497.0 Asia 58.020 2117.484526
Yemen Rep. 2002 18701257.0 Asia 60.308 2234.820827
Yemen Rep. 2007 22211743.0 Asia 62.698 2280.769906

396 rows × 5 columns

In [618]:
data[data['pop']>1e8]
Out[618]:
year pop continent lifeExp gdpPercap
country
Bangladesh 1987 103764241.0 Asia 52.819 751.979403
Bangladesh 1992 113704579.0 Asia 56.018 837.810164
Bangladesh 1997 123315288.0 Asia 59.412 972.770035
Bangladesh 2002 135656790.0 Asia 62.013 1136.390430
Bangladesh 2007 150448339.0 Asia 64.062 1391.253792
... ... ... ... ... ...
United States 1987 242803533.0 Americas 75.020 29884.350410
United States 1992 256894189.0 Americas 76.090 32003.932240
United States 1997 272911760.0 Americas 76.810 35767.433030
United States 2002 287675526.0 Americas 77.310 39097.099550
United States 2007 301139947.0 Americas 78.242 42951.653090

77 rows × 5 columns

In [619]:
data[data['lifeExp']<60]
Out[619]:
year pop continent lifeExp gdpPercap
country
Afghanistan 1952 8425333.0 Asia 28.801 779.445314
Afghanistan 1957 9240934.0 Asia 30.332 820.853030
Afghanistan 1962 10267083.0 Asia 31.997 853.100710
Afghanistan 1967 11537966.0 Asia 34.020 836.197138
Afghanistan 1972 13079460.0 Asia 36.088 739.981106
... ... ... ... ... ...
Zimbabwe 1972 5861135.0 Africa 55.635 799.362176
Zimbabwe 1977 6642107.0 Africa 57.674 685.587682
Zimbabwe 1997 11404948.0 Africa 46.809 792.449960
Zimbabwe 2002 11926563.0 Africa 39.989 672.038623
Zimbabwe 2007 12311143.0 Africa 43.487 469.709298

827 rows × 5 columns

numerical indexing

In [620]:
data.iloc[33]
Out[620]:
year               1997
pop          2.9072e+07
continent        Africa
lifeExp          69.152
gdpPercap        4797.3
Name: Algeria, dtype: object
In [621]:
data.iloc[33,3]
Out[621]:
69.152

rows and columns

In [577]:
data.loc['Afghanistan','pop']
Out[577]:
country
Afghanistan     8425333.0
Afghanistan     9240934.0
Afghanistan    10267083.0
Afghanistan    11537966.0
Afghanistan    13079460.0
Afghanistan    14880372.0
Afghanistan    12881816.0
Afghanistan    13867957.0
Afghanistan    16317921.0
Afghanistan    22227415.0
Afghanistan    25268405.0
Afghanistan    31889923.0
Name: pop, dtype: float64
In [ ]:
 

statistics on numerical columns

In [579]:
data.loc['Afghanistan','pop'].describe()
Out[579]:
count    1.200000e+01
mean     1.582372e+07
std      7.114583e+06
min      8.425333e+06
25%      1.122025e+07
50%      1.347371e+07
75%      1.779529e+07
max      3.188992e+07
Name: pop, dtype: float64

Grouping: optional

In [625]:
s=data.groupby(['continent']).mean()
s.head()
Out[625]:
year pop lifeExp gdpPercap
continent
Africa 1979.5 9.916003e+06 48.865330 2193.754578
Americas 1979.5 2.450479e+07 64.658737 7136.110356
Asia 1979.5 7.703872e+07 60.064903 7902.150428
Europe 1979.5 1.716976e+07 71.903686 14469.475533
Oceania 1979.5 8.874672e+06 74.326208 18621.609223
In [632]:
averages_by_continent=data.groupby(['continent','year']).mean()
averages_by_continent.loc['Asia'].round()
Out[632]:
pop lifeExp gdpPercap
year
1952 42283556.0 46.0 5195.0
1957 47356988.0 49.0 5788.0
1962 51404763.0 52.0 5729.0
1967 57747361.0 55.0 5971.0
1972 65180977.0 57.0 8187.0
1977 72257987.0 60.0 7791.0
1982 79095018.0 63.0 7434.0
1987 87006690.0 65.0 7608.0
1992 94948248.0 67.0 8640.0
1997 102523803.0 68.0 9834.0
2002 109145521.0 69.0 10174.0
2007 115513752.0 71.0 12473.0

Pivot table

In [652]:
lifeExp_over_time = pd.pivot_table(data,index='country',columns='year',values='lifeExp')
In [653]:
lifeExp_over_time.loc['Afghanistan',:]
Out[653]:
year
1952    28.801
1957    30.332
1962    31.997
1967    34.020
1972    36.088
1977    38.438
1982    39.854
1987    40.822
1992    41.674
1997    41.763
2002    42.129
2007    43.828
Name: Afghanistan, dtype: float64
In [656]:
stats_over_time = pd.pivot_table(data,index=['continent','country'],columns='year',values=['lifeExp','gdpPercap','pop'])
stats_over_time_by_continent=stats_over_time.groupby('continent').mean()
In [660]:
stats_over_time_by_continent['pop'].round(-3)
Out[660]:
year 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
continent
Africa 4570000.0 5093000.0 5702000.0 6448000.0 7305000.0 8328000.0 9603000.0 11055000.0 12675000.0 14304000.0 16033000.0 17876000.0
Americas 13806000.0 15478000.0 17331000.0 19230000.0 21175000.0 23123000.0 25212000.0 27310000.0 29571000.0 31876000.0 33991000.0 35955000.0
Asia 42284000.0 47357000.0 51405000.0 57747000.0 65181000.0 72258000.0 79095000.0 87007000.0 94948000.0 102524000.0 109146000.0 115514000.0
Europe 13937000.0 14596000.0 15345000.0 16039000.0 16688000.0 17239000.0 17709000.0 18103000.0 18605000.0 18965000.0 19274000.0 19537000.0
Oceania 5343000.0 5971000.0 6642000.0 7300000.0 8053000.0 8620000.0 9197000.0 9787000.0 10460000.0 11121000.0 11727000.0 12275000.0

Plotting

plotting libraries:

In [662]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
In [672]:
plt.plot([1,2,3],[1,4,9],c='blue',linewidth=3,linestyle='solid')
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.suptitle('A demo plot')
Out[672]:
Text(0.5, 0.98, 'A demo plot')
In [675]:
plt.plot([1,2,3],[1,4,9],c='orange',linewidth=1,linestyle='dashed')
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.suptitle('A demo plot')
Out[675]:
Text(0.5, 0.98, 'A demo plot')
In [699]:
plt.scatter([1,2,3],[1,4,9],c=[0,1,2],s=100)
plt.plot([1,2,3],[1,4,9])
plt.xticks([1,2,3])
plt.yticks([1,4,9])
plt.xlim(0,5)
plt.ylim(0,10)
plt.grid(False)
In [ ]:
 

Plotting from pandas

  • the index is the x axis and the values in the column are plotted against that
  • or you can specify x and y
In [706]:
data.head()
Out[706]:
year pop continent lifeExp gdpPercap
country
Afghanistan 1952 8425333.0 Asia 28.801 779.445314
Afghanistan 1957 9240934.0 Asia 30.332 820.853030
Afghanistan 1962 10267083.0 Asia 31.997 853.100710
Afghanistan 1967 11537966.0 Asia 34.020 836.197138
Afghanistan 1972 13079460.0 Asia 36.088 739.981106
In [707]:
lifeExp_over_time.head()
Out[707]:
year 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
country
Afghanistan 28.801 30.332 31.997 34.020 36.088 38.438 39.854 40.822 41.674 41.763 42.129 43.828
Albania 55.230 59.280 64.820 66.220 67.690 68.930 70.420 72.000 71.581 72.950 75.651 76.423
Algeria 43.077 45.685 48.303 51.407 54.518 58.014 61.368 65.799 67.744 69.152 70.994 72.301
Angola 30.015 31.999 34.000 35.985 37.928 39.483 39.942 39.906 40.647 40.963 41.003 42.731
Argentina 62.485 64.399 65.142 65.634 67.065 68.481 69.942 70.774 71.868 73.275 74.340 75.320
In [737]:
lifeExp_over_time.T['Afghanistan'].plot()
Out[737]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a24217a50>
In [713]:
transpose = lifeExp_over_time.T
transpose[['Afghanistan','Germany']].plot(kind='bar')
Out[713]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a215b9b50>
In [753]:
_=stats_over_time.groupby('continent').mean()['gdpPercap'].T.plot()
In [751]:
stats_over_time.groupby('continent').mean()['lifeExp'].T.plot()
_=plt.suptitle('Mean Life Expectancy over Time')
In [754]:
_=stats_over_time['gdpPercap'].loc[['Africa']].mean().T.plot(kind='bar',legend=None)
In [774]:
data.plot(kind='scatter',y='gdpPercap',x='lifeExp',c='year',logy=True,figsize=(10,10))
plt.savefig('scatter.png')
In [763]:
pwd
Out[763]:
'/Users/swc/2020-01-13/Course Notes'
In [783]:
_=data[data['year']==2002].groupby('continent').sum()['pop'].plot(kind='pie',figsize=(10,10),label='Population')
In [805]:
_=data[(data['year']==2002) & (data['continent']=='Asia')].groupby('country').sum()['pop'].sort_values().plot(kind='bar',figsize=(10,10),)
In [ ]: