A few more notes on lists

copying

  • if L is a list, then setting S=L creates a new NAME (S) for L, but not a new object
  • If you apply an operation to L, then you get a new element.
In [52]:
## methods vs functions

- methods act on a list in place
  - append ```L.append(x)```
  - del  ```del L[3]```
  - sort  ```L.sort()```
 
- functions create a new list, leaving the original one alone

   - S = L + [x]
   - Y = sorted(L)
  
In [ ]:
 
In [32]:
primes = [2,3,5,7,11,13,17,19]
new = primes.append(23)
print('returned value = ',new,'original = ',primes)
returned value =  None original =  [2, 3, 5, 7, 11, 13, 17, 19, 23]
In [31]:
primes = [2,3,5,7,11,13,17,19]
new = primes + [23]
print('returned value = ',new,'original = ',primes)
returned value =  [2, 3, 5, 7, 11, 13, 17, 19, 23] original =  [2, 3, 5, 7, 11, 13, 17, 19]
In [4]:
del primes[3]
In [5]:
primes
Out[5]:
[2, 3, 5, 11, 13, 17, 19, 23]
In [41]:
primes = [2,3,5,7,11,13,17,19]
new = primes.sort(reverse=True)
print('returned value = ',new,'original = ',primes)
returned value =  None original =  [19, 17, 13, 11, 7, 5, 3, 2]
In [38]:
primes = [2,3,5,7,11,13,17,19]
new= sorted(primes,reverse=True)
print('returned value = ',new,'original = ',primes)
returned value =  [19, 17, 13, 11, 7, 5, 3, 2] original =  [2, 3, 5, 7, 11, 13, 17, 19]

copying

  • if L is a list, then setting S=L creates a new NAME (S) for L, but not a new object
  • If you apply an operation to L, then you get a new element.
In [53]:
primes = [2,3,5,7,11,13,17,19]
S = primes
primes.append(23)
print(S)
[2, 3, 5, 7, 11, 13, 17, 19, 23]

taking a slice is one way to force a copy

In [55]:
primes = [2,3,5,7,11,13,17,19]
S = primes[:]
primes.append(23)
print(S)
[2, 3, 5, 7, 11, 13, 17, 19]

so is using the .copy() method

In [59]:
primes = [2,3,5,7,11,13,17,19]
S = primes.copy()
primes.append(23)
print(S,primes)
[2, 3, 5, 7, 11, 13, 17, 19] [2, 3, 5, 7, 11, 13, 17, 19, 23]

lists and strings

  • list
  • join and split
In [11]:
','.join(['a','b','c'])
Out[11]:
'a,b,c'
In [12]:
list('ABCDEF')
Out[12]:
['A', 'B', 'C', 'D', 'E', 'F']
In [15]:
commas = ','.join(list('ABCDEF'))
print(commas)
A,B,C,D,E,F
In [16]:
commas.split(',')
Out[16]:
['A', 'B', 'C', 'D', 'E', 'F']
In [19]:
data = ".5,.2,1.8,17,21"
In [23]:
data.split(',')
Out[23]:
['.5', '.2', '1.8', '17', '21']
In [25]:
print('\t'.join(data.split(',')))
.5	.2	1.8	17	21

Iteration

  • iteration means doing an operation repeatedly with varying parameters
  • the for keyword is the fundamental iterator in python
  • INDENTATION MATTERS IN PYTHON SYNTAX
    • you can indent using tabs or spaces, but blocks must line up
    • notice the colon at the end of the first line
In [62]:
for x in [2,3,5,7,11,13,17,19]:
    print(x,'is a prime')
    print(x**2,'is its square')
2 is a prime
4 is its square
3 is a prime
9 is its square
5 is a prime
25 is its square
7 is a prime
49 is its square
11 is a prime
121 is its square
13 is a prime
169 is its square
17 is a prime
289 is its square
19 is a prime
361 is its square
In [64]:
for x in [2,3,5,7,11,13,17,19]:
  print(x,'is a prime')
   print(x**2,'is its square')
  File "<ipython-input-64-36ca985a007c>", line 3
    print(x**2,'is its square')
    ^
IndentationError: unexpected indent

iterating over a range, or over a list

  • range(100)
  • range(10,20)
  • range(10,20,30)
  • np.arange(.1,1.1,.05)
In [65]:
for x in range(10):
    print(x)
0
1
2
3
4
5
6
7
8
9
In [66]:
L=[]
for x in range(10):
    L.append(x)
print(L)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [68]:
for x in range(10,0,-1):
    print(x)
10
9
8
7
6
5
4
3
2
1
In [69]:
import numpy as np
In [70]:
for x in np.arange(1,5,.1):
    print(x,np.sin(x))
1.0 0.8414709848078965
1.1 0.8912073600614354
1.2000000000000002 0.9320390859672264
1.3000000000000003 0.9635581854171931
1.4000000000000004 0.9854497299884603
1.5000000000000004 0.9974949866040544
1.6000000000000005 0.9995736030415051
1.7000000000000006 0.9916648104524686
1.8000000000000007 0.973847630878195
1.9000000000000008 0.9463000876874142
2.000000000000001 0.9092974268256814
2.100000000000001 0.8632093666488733
2.200000000000001 0.8084964038195895
2.300000000000001 0.7457052121767194
2.4000000000000012 0.67546318055115
2.5000000000000013 0.5984721441039554
2.6000000000000014 0.515501371821463
2.7000000000000015 0.42737988023382856
2.8000000000000016 0.3349881501559034
2.9000000000000017 0.23924932921398068
3.0000000000000018 0.14112000805986546
3.100000000000002 0.041580662433288715
3.200000000000002 -0.058374143427581855
3.300000000000002 -0.1577456941432504
3.400000000000002 -0.2555411020268334
3.500000000000002 -0.35078322768962195
3.6000000000000023 -0.44252044329485446
3.7000000000000024 -0.5298361409084953
3.8000000000000025 -0.611857890942721
3.9000000000000026 -0.6877661591839757
4.000000000000003 -0.75680249530793
4.100000000000003 -0.8182771110644123
4.200000000000003 -0.8715757724135894
4.3000000000000025 -0.916165936749456
4.400000000000003 -0.9516020738895169
4.5000000000000036 -0.9775301176650978
4.600000000000003 -0.9936910036334649
4.700000000000003 -0.9999232575641009
4.800000000000003 -0.9961646088358403
4.900000000000004 -0.9824526126243318

accumulating

In [74]:
sum = 0
for x in range(20):
    sum = sum + x
print(sum)
190
In [79]:
sentence = 'Now is the time for all good me to come to the aid of their party'
words = sentence.split()
sum = 0
for x in words:
    print(x,len(x))
    sum = sum + len(x)
print('Total length is',sum)
    
Now 3
is 2
the 3
time 4
for 3
all 3
good 4
me 2
to 2
come 4
to 2
the 3
aid 3
of 2
their 5
party 5
Total length is 50
In [86]:
# iterating over files
import pandas as pd
for file in ['../data/gapminder_gdp_africa.csv','../data/gapminder_gdp_asia.csv']:
    data = pd.read_csv(file,index_col='country')
    print(file, data['gdpPercap_1957'].idxmin())
../data/gapminder_gdp_africa.csv Lesotho
../data/gapminder_gdp_asia.csv Myanmar

reading a directory

glob gives access to UNIX wild cards

In [87]:
from glob import glob
In [105]:
for file in glob('../data/*gdp*.csv'):
    data = pd.read_csv(file,index_col='country')
    print(file, data['gdpPercap_1957'].idxmin())
../data/gapminder_gdp_americas.csv Dominican Republic
../data/gapminder_gdp_europe.csv Bosnia and Herzegovina
../data/gapminder_gdp_oceania.csv Australia
../data/gapminder_gdp_africa.csv Lesotho
../data/gapminder_gdp_asia.csv Myanmar
In [106]:
for file in glob('../data/*gdp*.csv'):
    data = pd.read_csv(file,index_col='country')
    continent = file.split('_')[2].split('.')[0].upper()
    print(continent, '\nPoorest in 1957:',data['gdpPercap_1957'].idxmin(),'\nRichest in 1957',data['gdpPercap_1957'].idxmax(),'\n\n')
AMERICAS 
Poorest in 1957: Dominican Republic 
Richest in 1957 United States 


EUROPE 
Poorest in 1957: Bosnia and Herzegovina 
Richest in 1957 Switzerland 


OCEANIA 
Poorest in 1957: Australia 
Richest in 1957 New Zealand 


AFRICA 
Poorest in 1957: Lesotho 
Richest in 1957 South Africa 


ASIA 
Poorest in 1957: Myanmar 
Richest in 1957 Kuwait 


nested loops

In [107]:
for fruit in ['apple','pear','grape']:
    for color in ['green','orange','yellow']:
        print('I wish I had a ',color,fruit) 
I wish I had a  green apple
I wish I had a  orange apple
I wish I had a  yellow apple
I wish I had a  green pear
I wish I had a  orange pear
I wish I had a  yellow pear
I wish I had a  green grape
I wish I had a  orange grape
I wish I had a  yellow grape

putting some stuff together:

- string operations to extract the continent from the file name
- making a column name
In [104]:
for file in glob('../data/*gdp*.csv'):
    for year in ['1957','2002']:
        data = pd.read_csv(file,index_col='country')
        continent = file.split('_')[2].split('.')[0].upper()
        key = 'gdpPercap_'+year
        print(continent, '\nPoorest in '+year+':',data[key].idxmin(),'\nRichest in '+year+':',data[key].idxmax(),'\n\n')
AMERICAS 
Poorest in 1957: Dominican Republic 
Richest in 1957: United States 


AMERICAS 
Poorest in 2002: Haiti 
Richest in 2002: United States 


EUROPE 
Poorest in 1957: Bosnia and Herzegovina 
Richest in 1957: Switzerland 


EUROPE 
Poorest in 2002: Albania 
Richest in 2002: Norway 


OCEANIA 
Poorest in 1957: Australia 
Richest in 1957: New Zealand 


OCEANIA 
Poorest in 2002: New Zealand 
Richest in 2002: Australia 


AFRICA 
Poorest in 1957: Lesotho 
Richest in 1957: South Africa 


AFRICA 
Poorest in 2002: Congo Dem. Rep. 
Richest in 2002: Gabon 


ASIA 
Poorest in 1957: Myanmar 
Richest in 1957: Kuwait 


ASIA 
Poorest in 2002: Myanmar 
Richest in 2002: Singapore 


In [111]:
import glob
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1)
for filename in glob.glob('../data/gapminder_gdp*.csv'):
    dataframe = pd.read_csv(filename)
    # extract <region> from the filename, expected to be in the format 'data/gapminder_gdp_<region>.csv'.
    # we will split the string using the split method and `_` as our separator,
    # retrieve the last string in the list that split returns (`<region>.csv`), 
    # and then remove the `.csv` extension from that string.
    region = filename.split('_')[-1][:-4] 
    dataframe.mean().plot(ax=ax, label=region)
plt.legend()
plt.show()

Functions

We have seen many examples of built-in functions (print, math functions, read_csv, etc are all functions). You can make your own function: takes certain inputs (arguments) and returns a value.

In [113]:
def print_date(year, month, day):
    joined = str(year)+'/'+str(month)+'/'+str(day)
    print(joined)
In [114]:
print_date(2007,11,31)
2007/11/31

You can return the value for use in other computations

In [116]:
def format_date(year,month,day):
    joined = str(year)+'/'+str(month)+'/'+str(day)
    return joined
In [117]:
format_date(2007,12,10)
Out[117]:
'2007/12/10'
In [118]:
dates = []
for year in range(2007,2010):
    for month in range(5, 8):
        for day in range(12,20):
            dates.append(format_date(year,month,day))
            
In [119]:
dates
Out[119]:
['2007/5/12',
 '2007/5/13',
 '2007/5/14',
 '2007/5/15',
 '2007/5/16',
 '2007/5/17',
 '2007/5/18',
 '2007/5/19',
 '2007/6/12',
 '2007/6/13',
 '2007/6/14',
 '2007/6/15',
 '2007/6/16',
 '2007/6/17',
 '2007/6/18',
 '2007/6/19',
 '2007/7/12',
 '2007/7/13',
 '2007/7/14',
 '2007/7/15',
 '2007/7/16',
 '2007/7/17',
 '2007/7/18',
 '2007/7/19',
 '2008/5/12',
 '2008/5/13',
 '2008/5/14',
 '2008/5/15',
 '2008/5/16',
 '2008/5/17',
 '2008/5/18',
 '2008/5/19',
 '2008/6/12',
 '2008/6/13',
 '2008/6/14',
 '2008/6/15',
 '2008/6/16',
 '2008/6/17',
 '2008/6/18',
 '2008/6/19',
 '2008/7/12',
 '2008/7/13',
 '2008/7/14',
 '2008/7/15',
 '2008/7/16',
 '2008/7/17',
 '2008/7/18',
 '2008/7/19',
 '2009/5/12',
 '2009/5/13',
 '2009/5/14',
 '2009/5/15',
 '2009/5/16',
 '2009/5/17',
 '2009/5/18',
 '2009/5/19',
 '2009/6/12',
 '2009/6/13',
 '2009/6/14',
 '2009/6/15',
 '2009/6/16',
 '2009/6/17',
 '2009/6/18',
 '2009/6/19',
 '2009/7/12',
 '2009/7/13',
 '2009/7/14',
 '2009/7/15',
 '2009/7/16',
 '2009/7/17',
 '2009/7/18',
 '2009/7/19']

functions and data analysis

Recall that we worked with the pandas dataframe gapminder_data.csv.

In [130]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
In [131]:
data = pd.read_csv('../data/gapminder_data.csv',index_col='country')
In [132]:
percapita = pd.pivot_table(data,index='country',columns='year',values='gdpPercap')
In [134]:
percapita.loc['Afghanistan'].plot(title='Afghanistan GDP Per capita over Time')
Out[134]:
<matplotlib.axes._subplots.AxesSubplot at 0x1174d7940>
In [151]:
def country_gdp(country):
    title = country + ' GDP Per Capita over Time'
    percapita.loc[country].plot(figsize=(8,8),legend=True,title='GDP Per Capita')
In [152]:
country_gdp("United States")
country_gdp("Germany")
country_gdp("Switzerland")
In [153]:
for country in ['United States','China','Germany','Australia','Brazil']:
    country_gdp(country)

A random walk

In [154]:
import numpy.random as rnd
In [155]:
x = rnd.choice([-1,1])
In [156]:
def random_walk(N):
    spot = 0
    L = []
    for i in range(N):
        spot = spot + rnd.choice([-1,1])
        L.append(spot)
    return L
In [162]:
end_spots = []
for i in range(200):
    walk = random_walk(100)
    plt.plot(range(100),walk)
    end_spots.append(walk[-1])

    
In [166]:
s=plt.hist(end_spots,bins=10)

Variable scope

variables inside functions are "local" to the function and changes to them don't last after the function ends

  • the scope of a variable is the region where it is defined -- basically, in a function block and anything inside that.
In [174]:
def lister(x):
    L = [x]*10
    v = 47
    print(L)
In [173]:
print(L)
lister(4)
print(L)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

There are subtleties to variable scope. If a variable is NOT NAMED as an argument, then it is assumed to come from "outside" the function.

In [186]:
L = ['a','b','c']
def j(x):
    L.append(x)
In [187]:
j('u')
In [188]:
L
Out[188]:
['a', 'b', 'c', 'u']

And structures (like lists, and arrays) CAN be modified inside a function EVEN IF they are given as an argument.

In [193]:
L = ['a','b','c']
def j(x,L):
    L.append(x)
j('x',L)
print(L)
['a', 'b', 'c', 'x']

The full set of rules is a bit complicated and we won't go into all the details.

Conditionals

In [194]:
def my_abs(x):
    if x<0:
        return -x
    else:
        return x
In [195]:
my_abs(-1)
Out[195]:
1

This function returns a "tuple" which is a pair of lists. You can get the two lists with subscripts or with the A, B = construction

In [196]:
def split_threshold(threshold,L):
    Low = []
    High = []
    for item in L:
        if item<threshold:
            Low.append(item)
        else:
            High.append(item)
    return Low, High
        
        
        
In [203]:
T = split_threshold(0,[-1,-5,2,-13,11,100])
print(T)
([-1, -5, -13], [2, 11, 100])
In [204]:
print(T[0])
[-1, -5, -13]
In [205]:
print(T[1])
[2, 11, 100]

This is the syntax for unpacking a tuple

In [198]:
Low, High = split_threshold(0,[-1,-5,2,-13,11,100])
In [206]:
Low
Out[206]:
[-1, -5, -13]
In [207]:
High
Out[207]:
[2, 11, 100]

Python admits and and or operators

In [209]:
data['pop'].head()
Out[209]:
country
Afghanistan     8425333.0
Afghanistan     9240934.0
Afghanistan    10267083.0
Afghanistan    11537966.0
Afghanistan    13079460.0
Name: pop, dtype: float64
In [217]:
for x in 'Jeremy Teitelbaum':
    if (x>='r' and x<='u'):
        print(x)
r
t
u

applying a function in pandas

In [223]:
population_size(300000000)
Out[223]:
'medium'
In [227]:
data['pop_class']=data['pop'].apply(population_size)
In [242]:
data.head()
Out[242]:
year pop continent lifeExp gdpPercap pop_class
country
Afghanistan 1952 8425333.0 Asia 28.801 779.445314 small
Afghanistan 1957 9240934.0 Asia 30.332 820.853030 small
Afghanistan 1962 10267083.0 Asia 31.997 853.100710 medium
Afghanistan 1967 11537966.0 Asia 34.020 836.197138 medium
Afghanistan 1972 13079460.0 Asia 36.088 739.981106 medium
In [241]:
data[(data['year']==2002)].groupby('pop_class')['lifeExp'].mean().plot(kind='bar')
Out[241]:
<matplotlib.axes._subplots.AxesSubplot at 0x1190232e8>
In [243]:
def life_exp_by_class_and_year(year):
    data[(data['year']==2002)].groupby('pop_class')['lifeExp'].mean().plot(kind='bar')
In [245]:
life_exp_by_class_and_year(1957)
In [246]:
data[(data['year']==2002) & (data['pop_class'] == 'large')]
Out[246]:
year pop continent lifeExp gdpPercap pop_class
country
China 2002 1.280400e+09 Asia 72.028 3119.280896 large
India 2002 1.034173e+09 Asia 62.879 1746.769454 large

docstrings

In [254]:
def threshold(x,L):
    """Returns (Low, High) where Low is the list of elements in L less than x, 
    and High is a list of those greater than or equal to x"""
    Low, High = [], []
    for item in L:
        if item<x:
            Low.append(item)
        else:
            High.append(item)
    return Low, High
In [255]:
?threshold
Signature: threshold(x, L)
Docstring:
Returns (Low, High) where Low is the list of elements in L less than x, 
and High is a list of those greater than or equal to x
File:      ~/GitHub/Carpentries/2020-01-13/notes/<ipython-input-254-5a9844fcc62c>
Type:      function

default arguments

In [256]:
def threshold(L,x=0):
    """Returns (Low, High) where Low is the list of elements in L less than x, 
    and High is a list of those greater than or equal to x.  x defaults to zero."""
    Low, High = [], []
    for item in L:
        if item<x:
            Low.append(item)
        else:
            High.append(item)
    return Low, High
In [257]:
threshold([1,-3,2,5])
Out[257]:
([-3], [1, 2, 5])
In [258]:
?threshold
Signature: threshold(L, x=0)
Docstring:
Returns (Low, High) where Low is the list of elements in L less than x, 
and High is a list of those greater than or equal to x.  x defaults to zero.
File:      ~/GitHub/Carpentries/2020-01-13/notes/<ipython-input-256-bc8a5db254e7>
Type:      function
In [259]:
?print
Docstring:
print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)

Prints the values to a stream, or to sys.stdout by default.
Optional keyword arguments:
file:  a file-like object (stream); defaults to the current sys.stdout.
sep:   string inserted between values, default a space.
end:   string appended after the last value, default a newline.
flush: whether to forcibly flush the stream.
Type:      builtin_function_or_method
In [ ]: