copying¶

• if L is a list, then setting S=L creates a new NAME (S) for L, but not a new object
• If you apply an operation to L, then you get a new element.
In [52]:
## methods vs functions

- methods act on a list in place
- append L.append(x)
- del  del L[3]
- sort  L.sort()

- functions create a new list, leaving the original one alone

- S = L + [x]
- Y = sorted(L)


In [ ]:


In [32]:
primes = [2,3,5,7,11,13,17,19]
new = primes.append(23)
print('returned value = ',new,'original = ',primes)

returned value =  None original =  [2, 3, 5, 7, 11, 13, 17, 19, 23]

In [31]:
primes = [2,3,5,7,11,13,17,19]
new = primes + [23]
print('returned value = ',new,'original = ',primes)

returned value =  [2, 3, 5, 7, 11, 13, 17, 19, 23] original =  [2, 3, 5, 7, 11, 13, 17, 19]

In [4]:
del primes[3]

In [5]:
primes

Out[5]:
[2, 3, 5, 11, 13, 17, 19, 23]
In [41]:
primes = [2,3,5,7,11,13,17,19]
new = primes.sort(reverse=True)
print('returned value = ',new,'original = ',primes)

returned value =  None original =  [19, 17, 13, 11, 7, 5, 3, 2]

In [38]:
primes = [2,3,5,7,11,13,17,19]
new= sorted(primes,reverse=True)
print('returned value = ',new,'original = ',primes)

returned value =  [19, 17, 13, 11, 7, 5, 3, 2] original =  [2, 3, 5, 7, 11, 13, 17, 19]


copying¶

• if L is a list, then setting S=L creates a new NAME (S) for L, but not a new object
• If you apply an operation to L, then you get a new element.
In [53]:
primes = [2,3,5,7,11,13,17,19]
S = primes
primes.append(23)
print(S)

[2, 3, 5, 7, 11, 13, 17, 19, 23]


taking a slice is one way to force a copy

In [55]:
primes = [2,3,5,7,11,13,17,19]
S = primes[:]
primes.append(23)
print(S)

[2, 3, 5, 7, 11, 13, 17, 19]


so is using the .copy() method

In [59]:
primes = [2,3,5,7,11,13,17,19]
S = primes.copy()
primes.append(23)
print(S,primes)

[2, 3, 5, 7, 11, 13, 17, 19] [2, 3, 5, 7, 11, 13, 17, 19, 23]


lists and strings¶

• list
• join and split
In [11]:
','.join(['a','b','c'])

Out[11]:
'a,b,c'
In [12]:
list('ABCDEF')

Out[12]:
['A', 'B', 'C', 'D', 'E', 'F']
In [15]:
commas = ','.join(list('ABCDEF'))
print(commas)

A,B,C,D,E,F

In [16]:
commas.split(',')

Out[16]:
['A', 'B', 'C', 'D', 'E', 'F']
In [19]:
data = ".5,.2,1.8,17,21"

In [23]:
data.split(',')

Out[23]:
['.5', '.2', '1.8', '17', '21']
In [25]:
print('\t'.join(data.split(',')))

.5	.2	1.8	17	21


Iteration¶

• iteration means doing an operation repeatedly with varying parameters
• the for keyword is the fundamental iterator in python
• INDENTATION MATTERS IN PYTHON SYNTAX
• you can indent using tabs or spaces, but blocks must line up
• notice the colon at the end of the first line
In [62]:
for x in [2,3,5,7,11,13,17,19]:
print(x,'is a prime')
print(x**2,'is its square')

2 is a prime
4 is its square
3 is a prime
9 is its square
5 is a prime
25 is its square
7 is a prime
49 is its square
11 is a prime
121 is its square
13 is a prime
169 is its square
17 is a prime
289 is its square
19 is a prime
361 is its square

In [64]:
for x in [2,3,5,7,11,13,17,19]:
print(x,'is a prime')
print(x**2,'is its square')

  File "<ipython-input-64-36ca985a007c>", line 3
print(x**2,'is its square')
^
IndentationError: unexpected indent


iterating over a range, or over a list¶

• range(100)
• range(10,20)
• range(10,20,30)
• np.arange(.1,1.1,.05)
In [65]:
for x in range(10):
print(x)

0
1
2
3
4
5
6
7
8
9

In [66]:
L=[]
for x in range(10):
L.append(x)
print(L)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [68]:
for x in range(10,0,-1):
print(x)

10
9
8
7
6
5
4
3
2
1

In [69]:
import numpy as np

In [70]:
for x in np.arange(1,5,.1):
print(x,np.sin(x))

1.0 0.8414709848078965
1.1 0.8912073600614354
1.2000000000000002 0.9320390859672264
1.3000000000000003 0.9635581854171931
1.4000000000000004 0.9854497299884603
1.5000000000000004 0.9974949866040544
1.6000000000000005 0.9995736030415051
1.7000000000000006 0.9916648104524686
1.8000000000000007 0.973847630878195
1.9000000000000008 0.9463000876874142
2.000000000000001 0.9092974268256814
2.100000000000001 0.8632093666488733
2.200000000000001 0.8084964038195895
2.300000000000001 0.7457052121767194
2.4000000000000012 0.67546318055115
2.5000000000000013 0.5984721441039554
2.6000000000000014 0.515501371821463
2.7000000000000015 0.42737988023382856
2.8000000000000016 0.3349881501559034
2.9000000000000017 0.23924932921398068
3.0000000000000018 0.14112000805986546
3.100000000000002 0.041580662433288715
3.200000000000002 -0.058374143427581855
3.300000000000002 -0.1577456941432504
3.400000000000002 -0.2555411020268334
3.500000000000002 -0.35078322768962195
3.6000000000000023 -0.44252044329485446
3.7000000000000024 -0.5298361409084953
3.8000000000000025 -0.611857890942721
3.9000000000000026 -0.6877661591839757
4.000000000000003 -0.75680249530793
4.100000000000003 -0.8182771110644123
4.200000000000003 -0.8715757724135894
4.3000000000000025 -0.916165936749456
4.400000000000003 -0.9516020738895169
4.5000000000000036 -0.9775301176650978
4.600000000000003 -0.9936910036334649
4.700000000000003 -0.9999232575641009
4.800000000000003 -0.9961646088358403
4.900000000000004 -0.9824526126243318


accumulating¶

In [74]:
sum = 0
for x in range(20):
sum = sum + x
print(sum)

190

In [79]:
sentence = 'Now is the time for all good me to come to the aid of their party'
words = sentence.split()
sum = 0
for x in words:
print(x,len(x))
sum = sum + len(x)
print('Total length is',sum)


Now 3
is 2
the 3
time 4
for 3
all 3
good 4
me 2
to 2
come 4
to 2
the 3
aid 3
of 2
their 5
party 5
Total length is 50

In [86]:
# iterating over files
import pandas as pd
for file in ['../data/gapminder_gdp_africa.csv','../data/gapminder_gdp_asia.csv']:
print(file, data['gdpPercap_1957'].idxmin())

../data/gapminder_gdp_africa.csv Lesotho
../data/gapminder_gdp_asia.csv Myanmar


In [87]:
from glob import glob

In [105]:
for file in glob('../data/*gdp*.csv'):
print(file, data['gdpPercap_1957'].idxmin())

../data/gapminder_gdp_americas.csv Dominican Republic
../data/gapminder_gdp_europe.csv Bosnia and Herzegovina
../data/gapminder_gdp_oceania.csv Australia
../data/gapminder_gdp_africa.csv Lesotho
../data/gapminder_gdp_asia.csv Myanmar

In [106]:
for file in glob('../data/*gdp*.csv'):
continent = file.split('_')[2].split('.')[0].upper()
print(continent, '\nPoorest in 1957:',data['gdpPercap_1957'].idxmin(),'\nRichest in 1957',data['gdpPercap_1957'].idxmax(),'\n\n')

AMERICAS
Poorest in 1957: Dominican Republic
Richest in 1957 United States

EUROPE
Poorest in 1957: Bosnia and Herzegovina
Richest in 1957 Switzerland

OCEANIA
Poorest in 1957: Australia
Richest in 1957 New Zealand

AFRICA
Poorest in 1957: Lesotho
Richest in 1957 South Africa

ASIA
Poorest in 1957: Myanmar
Richest in 1957 Kuwait



nested loops¶

In [107]:
for fruit in ['apple','pear','grape']:
for color in ['green','orange','yellow']:
print('I wish I had a ',color,fruit)

I wish I had a  green apple
I wish I had a  orange apple
I wish I had a  yellow apple
I wish I had a  green pear
I wish I had a  orange pear
I wish I had a  yellow pear
I wish I had a  green grape
I wish I had a  orange grape
I wish I had a  yellow grape


putting some stuff together:¶

- string operations to extract the continent from the file name
- making a column name
In [104]:
for file in glob('../data/*gdp*.csv'):
for year in ['1957','2002']:
continent = file.split('_')[2].split('.')[0].upper()
key = 'gdpPercap_'+year
print(continent, '\nPoorest in '+year+':',data[key].idxmin(),'\nRichest in '+year+':',data[key].idxmax(),'\n\n')

AMERICAS
Poorest in 1957: Dominican Republic
Richest in 1957: United States

AMERICAS
Poorest in 2002: Haiti
Richest in 2002: United States

EUROPE
Poorest in 1957: Bosnia and Herzegovina
Richest in 1957: Switzerland

EUROPE
Poorest in 2002: Albania
Richest in 2002: Norway

OCEANIA
Poorest in 1957: Australia
Richest in 1957: New Zealand

OCEANIA
Poorest in 2002: New Zealand
Richest in 2002: Australia

AFRICA
Poorest in 1957: Lesotho
Richest in 1957: South Africa

AFRICA
Poorest in 2002: Congo Dem. Rep.
Richest in 2002: Gabon

ASIA
Poorest in 1957: Myanmar
Richest in 1957: Kuwait

ASIA
Poorest in 2002: Myanmar
Richest in 2002: Singapore


In [111]:
import glob
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1)
for filename in glob.glob('../data/gapminder_gdp*.csv'):
# extract <region> from the filename, expected to be in the format 'data/gapminder_gdp_<region>.csv'.
# we will split the string using the split method and _ as our separator,
# retrieve the last string in the list that split returns (<region>.csv),
# and then remove the .csv extension from that string.
region = filename.split('_')[-1][:-4]
dataframe.mean().plot(ax=ax, label=region)
plt.legend()
plt.show()


Functions¶

We have seen many examples of built-in functions (print, math functions, read_csv, etc are all functions). You can make your own function: takes certain inputs (arguments) and returns a value.

In [113]:
def print_date(year, month, day):
joined = str(year)+'/'+str(month)+'/'+str(day)
print(joined)

In [114]:
print_date(2007,11,31)

2007/11/31


You can return the value for use in other computations¶

In [116]:
def format_date(year,month,day):
joined = str(year)+'/'+str(month)+'/'+str(day)
return joined

In [117]:
format_date(2007,12,10)

Out[117]:
'2007/12/10'
In [118]:
dates = []
for year in range(2007,2010):
for month in range(5, 8):
for day in range(12,20):
dates.append(format_date(year,month,day))


In [119]:
dates

Out[119]:
['2007/5/12',
'2007/5/13',
'2007/5/14',
'2007/5/15',
'2007/5/16',
'2007/5/17',
'2007/5/18',
'2007/5/19',
'2007/6/12',
'2007/6/13',
'2007/6/14',
'2007/6/15',
'2007/6/16',
'2007/6/17',
'2007/6/18',
'2007/6/19',
'2007/7/12',
'2007/7/13',
'2007/7/14',
'2007/7/15',
'2007/7/16',
'2007/7/17',
'2007/7/18',
'2007/7/19',
'2008/5/12',
'2008/5/13',
'2008/5/14',
'2008/5/15',
'2008/5/16',
'2008/5/17',
'2008/5/18',
'2008/5/19',
'2008/6/12',
'2008/6/13',
'2008/6/14',
'2008/6/15',
'2008/6/16',
'2008/6/17',
'2008/6/18',
'2008/6/19',
'2008/7/12',
'2008/7/13',
'2008/7/14',
'2008/7/15',
'2008/7/16',
'2008/7/17',
'2008/7/18',
'2008/7/19',
'2009/5/12',
'2009/5/13',
'2009/5/14',
'2009/5/15',
'2009/5/16',
'2009/5/17',
'2009/5/18',
'2009/5/19',
'2009/6/12',
'2009/6/13',
'2009/6/14',
'2009/6/15',
'2009/6/16',
'2009/6/17',
'2009/6/18',
'2009/6/19',
'2009/7/12',
'2009/7/13',
'2009/7/14',
'2009/7/15',
'2009/7/16',
'2009/7/17',
'2009/7/18',
'2009/7/19']

functions and data analysis¶

Recall that we worked with the pandas dataframe gapminder_data.csv.

In [130]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [131]:
data = pd.read_csv('../data/gapminder_data.csv',index_col='country')

In [132]:
percapita = pd.pivot_table(data,index='country',columns='year',values='gdpPercap')

In [134]:
percapita.loc['Afghanistan'].plot(title='Afghanistan GDP Per capita over Time')

Out[134]:
<matplotlib.axes._subplots.AxesSubplot at 0x1174d7940>
In [151]:
def country_gdp(country):
title = country + ' GDP Per Capita over Time'
percapita.loc[country].plot(figsize=(8,8),legend=True,title='GDP Per Capita')

In [152]:
country_gdp("United States")
country_gdp("Germany")
country_gdp("Switzerland")

In [153]:
for country in ['United States','China','Germany','Australia','Brazil']:
country_gdp(country)


A random walk¶

In [154]:
import numpy.random as rnd

In [155]:
x = rnd.choice([-1,1])

In [156]:
def random_walk(N):
spot = 0
L = []
for i in range(N):
spot = spot + rnd.choice([-1,1])
L.append(spot)
return L

In [162]:
end_spots = []
for i in range(200):
walk = random_walk(100)
plt.plot(range(100),walk)
end_spots.append(walk[-1])


In [166]:
s=plt.hist(end_spots,bins=10)


Variable scope¶

variables inside functions are "local" to the function and changes to them don't last after the function ends

• the scope of a variable is the region where it is defined -- basically, in a function block and anything inside that.
In [174]:
def lister(x):
L = [x]*10
v = 47
print(L)

In [173]:
print(L)
lister(4)
print(L)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


There are subtleties to variable scope. If a variable is NOT NAMED as an argument, then it is assumed to come from "outside" the function.

In [186]:
L = ['a','b','c']
def j(x):
L.append(x)

In [187]:
j('u')

In [188]:
L

Out[188]:
['a', 'b', 'c', 'u']

And structures (like lists, and arrays) CAN be modified inside a function EVEN IF they are given as an argument.

In [193]:
L = ['a','b','c']
def j(x,L):
L.append(x)
j('x',L)
print(L)

['a', 'b', 'c', 'x']


The full set of rules is a bit complicated and we won't go into all the details.

Conditionals¶

In [194]:
def my_abs(x):
if x<0:
return -x
else:
return x

In [195]:
my_abs(-1)

Out[195]:
1

This function returns a "tuple" which is a pair of lists. You can get the two lists with subscripts or with the A, B = construction

In [196]:
def split_threshold(threshold,L):
Low = []
High = []
for item in L:
if item<threshold:
Low.append(item)
else:
High.append(item)
return Low, High


In [203]:
T = split_threshold(0,[-1,-5,2,-13,11,100])
print(T)

([-1, -5, -13], [2, 11, 100])

In [204]:
print(T[0])

[-1, -5, -13]

In [205]:
print(T[1])

[2, 11, 100]


This is the syntax for unpacking a tuple

In [198]:
Low, High = split_threshold(0,[-1,-5,2,-13,11,100])

In [206]:
Low

Out[206]:
[-1, -5, -13]
In [207]:
High

Out[207]:
[2, 11, 100]

Python admits and and or operators

In [209]:
data['pop'].head()

Out[209]:
country
Afghanistan     8425333.0
Afghanistan     9240934.0
Afghanistan    10267083.0
Afghanistan    11537966.0
Afghanistan    13079460.0
Name: pop, dtype: float64
In [217]:
for x in 'Jeremy Teitelbaum':
if (x>='r' and x<='u'):
print(x)

r
t
u


applying a function in pandas¶

In [223]:
population_size(300000000)

Out[223]:
'medium'
In [227]:
data['pop_class']=data['pop'].apply(population_size)

In [242]:
data.head()

Out[242]:
year pop continent lifeExp gdpPercap pop_class
country
Afghanistan 1952 8425333.0 Asia 28.801 779.445314 small
Afghanistan 1957 9240934.0 Asia 30.332 820.853030 small
Afghanistan 1962 10267083.0 Asia 31.997 853.100710 medium
Afghanistan 1967 11537966.0 Asia 34.020 836.197138 medium
Afghanistan 1972 13079460.0 Asia 36.088 739.981106 medium
In [241]:
data[(data['year']==2002)].groupby('pop_class')['lifeExp'].mean().plot(kind='bar')

Out[241]:
<matplotlib.axes._subplots.AxesSubplot at 0x1190232e8>
In [243]:
def life_exp_by_class_and_year(year):
data[(data['year']==2002)].groupby('pop_class')['lifeExp'].mean().plot(kind='bar')

In [245]:
life_exp_by_class_and_year(1957)

In [246]:
data[(data['year']==2002) & (data['pop_class'] == 'large')]

Out[246]:
year pop continent lifeExp gdpPercap pop_class
country
China 2002 1.280400e+09 Asia 72.028 3119.280896 large
India 2002 1.034173e+09 Asia 62.879 1746.769454 large

docstrings¶

In [254]:
def threshold(x,L):
"""Returns (Low, High) where Low is the list of elements in L less than x,
and High is a list of those greater than or equal to x"""
Low, High = [], []
for item in L:
if item<x:
Low.append(item)
else:
High.append(item)
return Low, High

In [255]:
?threshold

Signature: threshold(x, L)
Docstring:
Returns (Low, High) where Low is the list of elements in L less than x,
and High is a list of those greater than or equal to x
File:      ~/GitHub/Carpentries/2020-01-13/notes/<ipython-input-254-5a9844fcc62c>
Type:      function


default arguments¶

In [256]:
def threshold(L,x=0):
"""Returns (Low, High) where Low is the list of elements in L less than x,
and High is a list of those greater than or equal to x.  x defaults to zero."""
Low, High = [], []
for item in L:
if item<x:
Low.append(item)
else:
High.append(item)
return Low, High

In [257]:
threshold([1,-3,2,5])

Out[257]:
([-3], [1, 2, 5])
In [258]:
?threshold

Signature: threshold(L, x=0)
Docstring:
Returns (Low, High) where Low is the list of elements in L less than x,
and High is a list of those greater than or equal to x.  x defaults to zero.
File:      ~/GitHub/Carpentries/2020-01-13/notes/<ipython-input-256-bc8a5db254e7>
Type:      function

In [259]:
?print

Docstring:
print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)

Prints the values to a stream, or to sys.stdout by default.
Optional keyword arguments:
file:  a file-like object (stream); defaults to the current sys.stdout.
sep:   string inserted between values, default a space.
end:   string appended after the last value, default a newline.
flush: whether to forcibly flush the stream.
Type:      builtin_function_or_method

In [ ]: