A few more notes on lists¶

copying¶

if L is a list, then setting S=L creates a new NAME (S) for L, but not a new object
If you apply an operation to L, then you get a new element.

## methods vs functions

- methods act on a list in place
  - append ```L.append(x)```
  - del  ```del L[3]```
  - sort  ```L.sort()```
 
- functions create a new list, leaving the original one alone

   - S = L + [x]
   - Y = sorted(L)

primes = [2,3,5,7,11,13,17,19]
new = primes.append(23)
print('returned value = ',new,'original = ',primes)

returned value =  None original =  [2, 3, 5, 7, 11, 13, 17, 19, 23]

primes = [2,3,5,7,11,13,17,19]
new = primes + [23]
print('returned value = ',new,'original = ',primes)

returned value =  [2, 3, 5, 7, 11, 13, 17, 19, 23] original =  [2, 3, 5, 7, 11, 13, 17, 19]

del primes[3]

primes

[2, 3, 5, 11, 13, 17, 19, 23]

primes = [2,3,5,7,11,13,17,19]
new = primes.sort(reverse=True)
print('returned value = ',new,'original = ',primes)

returned value =  None original =  [19, 17, 13, 11, 7, 5, 3, 2]

primes = [2,3,5,7,11,13,17,19]
new= sorted(primes,reverse=True)
print('returned value = ',new,'original = ',primes)

returned value =  [19, 17, 13, 11, 7, 5, 3, 2] original =  [2, 3, 5, 7, 11, 13, 17, 19]

copying¶

if L is a list, then setting S=L creates a new NAME (S) for L, but not a new object
If you apply an operation to L, then you get a new element.

primes = [2,3,5,7,11,13,17,19]
S = primes
primes.append(23)
print(S)

[2, 3, 5, 7, 11, 13, 17, 19, 23]

taking a slice is one way to force a copy

primes = [2,3,5,7,11,13,17,19]
S = primes[:]
primes.append(23)
print(S)

[2, 3, 5, 7, 11, 13, 17, 19]

so is using the .copy() method

primes = [2,3,5,7,11,13,17,19]
S = primes.copy()
primes.append(23)
print(S,primes)

[2, 3, 5, 7, 11, 13, 17, 19] [2, 3, 5, 7, 11, 13, 17, 19, 23]

lists and strings¶

list
join and split

','.join(['a','b','c'])

'a,b,c'

list('ABCDEF')

['A', 'B', 'C', 'D', 'E', 'F']

commas = ','.join(list('ABCDEF'))
print(commas)

A,B,C,D,E,F

commas.split(',')

['A', 'B', 'C', 'D', 'E', 'F']

data = ".5,.2,1.8,17,21"

data.split(',')

['.5', '.2', '1.8', '17', '21']

print('\t'.join(data.split(',')))

.5	.2	1.8	17	21

Iteration¶

iteration means doing an operation repeatedly with varying parameters
the for keyword is the fundamental iterator in python
INDENTATION MATTERS IN PYTHON SYNTAX
- you can indent using tabs or spaces, but blocks must line up
- notice the colon at the end of the first line

for x in [2,3,5,7,11,13,17,19]:
    print(x,'is a prime')
    print(x**2,'is its square')

2 is a prime
4 is its square
3 is a prime
9 is its square
5 is a prime
25 is its square
7 is a prime
49 is its square
11 is a prime
121 is its square
13 is a prime
169 is its square
17 is a prime
289 is its square
19 is a prime
361 is its square

for x in [2,3,5,7,11,13,17,19]:
  print(x,'is a prime')
   print(x**2,'is its square')

  File "<ipython-input-64-36ca985a007c>", line 3
    print(x**2,'is its square')
    ^
IndentationError: unexpected indent

See Tabs vs Spaces

iterating over a range, or over a list¶

range(100)
range(10,20)
range(10,20,30)
np.arange(.1,1.1,.05)

for x in range(10):
    print(x)

L=[]
for x in range(10):
    L.append(x)
print(L)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

for x in range(10,0,-1):
    print(x)

10
9
8
7
6
5
4
3
2
1

import numpy as np

for x in np.arange(1,5,.1):
    print(x,np.sin(x))

1.0 0.8414709848078965
1.1 0.8912073600614354
1.2000000000000002 0.9320390859672264
1.3000000000000003 0.9635581854171931
1.4000000000000004 0.9854497299884603
1.5000000000000004 0.9974949866040544
1.6000000000000005 0.9995736030415051
1.7000000000000006 0.9916648104524686
1.8000000000000007 0.973847630878195
1.9000000000000008 0.9463000876874142
2.000000000000001 0.9092974268256814
2.100000000000001 0.8632093666488733
2.200000000000001 0.8084964038195895
2.300000000000001 0.7457052121767194
2.4000000000000012 0.67546318055115
2.5000000000000013 0.5984721441039554
2.6000000000000014 0.515501371821463
2.7000000000000015 0.42737988023382856
2.8000000000000016 0.3349881501559034
2.9000000000000017 0.23924932921398068
3.0000000000000018 0.14112000805986546
3.100000000000002 0.041580662433288715
3.200000000000002 -0.058374143427581855
3.300000000000002 -0.1577456941432504
3.400000000000002 -0.2555411020268334
3.500000000000002 -0.35078322768962195
3.6000000000000023 -0.44252044329485446
3.7000000000000024 -0.5298361409084953
3.8000000000000025 -0.611857890942721
3.9000000000000026 -0.6877661591839757
4.000000000000003 -0.75680249530793
4.100000000000003 -0.8182771110644123
4.200000000000003 -0.8715757724135894
4.3000000000000025 -0.916165936749456
4.400000000000003 -0.9516020738895169
4.5000000000000036 -0.9775301176650978
4.600000000000003 -0.9936910036334649
4.700000000000003 -0.9999232575641009
4.800000000000003 -0.9961646088358403
4.900000000000004 -0.9824526126243318

accumulating¶

sum = 0
for x in range(20):
    sum = sum + x
print(sum)

190

sentence = 'Now is the time for all good me to come to the aid of their party'
words = sentence.split()
sum = 0
for x in words:
    print(x,len(x))
    sum = sum + len(x)
print('Total length is',sum)

Now 3
is 2
the 3
time 4
for 3
all 3
good 4
me 2
to 2
come 4
to 2
the 3
aid 3
of 2
their 5
party 5
Total length is 50

# iterating over files
import pandas as pd
for file in ['../data/gapminder_gdp_africa.csv','../data/gapminder_gdp_asia.csv']:
    data = pd.read_csv(file,index_col='country')
    print(file, data['gdpPercap_1957'].idxmin())

../data/gapminder_gdp_africa.csv Lesotho
../data/gapminder_gdp_asia.csv Myanmar

reading a directory¶

glob gives access to UNIX wild cards

from glob import glob

for file in glob('../data/*gdp*.csv'):
    data = pd.read_csv(file,index_col='country')
    print(file, data['gdpPercap_1957'].idxmin())

../data/gapminder_gdp_americas.csv Dominican Republic
../data/gapminder_gdp_europe.csv Bosnia and Herzegovina
../data/gapminder_gdp_oceania.csv Australia
../data/gapminder_gdp_africa.csv Lesotho
../data/gapminder_gdp_asia.csv Myanmar

for file in glob('../data/*gdp*.csv'):
    data = pd.read_csv(file,index_col='country')
    continent = file.split('_')[2].split('.')[0].upper()
    print(continent, '\nPoorest in 1957:',data['gdpPercap_1957'].idxmin(),'\nRichest in 1957',data['gdpPercap_1957'].idxmax(),'\n\n')

AMERICAS 
Poorest in 1957: Dominican Republic 
Richest in 1957 United States 


EUROPE 
Poorest in 1957: Bosnia and Herzegovina 
Richest in 1957 Switzerland 


OCEANIA 
Poorest in 1957: Australia 
Richest in 1957 New Zealand 


AFRICA 
Poorest in 1957: Lesotho 
Richest in 1957 South Africa 


ASIA 
Poorest in 1957: Myanmar 
Richest in 1957 Kuwait

nested loops¶

for fruit in ['apple','pear','grape']:
    for color in ['green','orange','yellow']:
        print('I wish I had a ',color,fruit)

I wish I had a  green apple
I wish I had a  orange apple
I wish I had a  yellow apple
I wish I had a  green pear
I wish I had a  orange pear
I wish I had a  yellow pear
I wish I had a  green grape
I wish I had a  orange grape
I wish I had a  yellow grape

putting some stuff together:¶

- string operations to extract the continent from the file name
- making a column name

for file in glob('../data/*gdp*.csv'):
    for year in ['1957','2002']:
        data = pd.read_csv(file,index_col='country')
        continent = file.split('_')[2].split('.')[0].upper()
        key = 'gdpPercap_'+year
        print(continent, '\nPoorest in '+year+':',data[key].idxmin(),'\nRichest in '+year+':',data[key].idxmax(),'\n\n')

AMERICAS 
Poorest in 1957: Dominican Republic 
Richest in 1957: United States 


AMERICAS 
Poorest in 2002: Haiti 
Richest in 2002: United States 


EUROPE 
Poorest in 1957: Bosnia and Herzegovina 
Richest in 1957: Switzerland 


EUROPE 
Poorest in 2002: Albania 
Richest in 2002: Norway 


OCEANIA 
Poorest in 1957: Australia 
Richest in 1957: New Zealand 


OCEANIA 
Poorest in 2002: New Zealand 
Richest in 2002: Australia 


AFRICA 
Poorest in 1957: Lesotho 
Richest in 1957: South Africa 


AFRICA 
Poorest in 2002: Congo Dem. Rep. 
Richest in 2002: Gabon 


ASIA 
Poorest in 1957: Myanmar 
Richest in 1957: Kuwait 


ASIA 
Poorest in 2002: Myanmar 
Richest in 2002: Singapore

import glob
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1)
for filename in glob.glob('../data/gapminder_gdp*.csv'):
    dataframe = pd.read_csv(filename)
    # extract <region> from the filename, expected to be in the format 'data/gapminder_gdp_<region>.csv'.
    # we will split the string using the split method and `_` as our separator,
    # retrieve the last string in the list that split returns (`<region>.csv`), 
    # and then remove the `.csv` extension from that string.
    region = filename.split('_')[-1][:-4] 
    dataframe.mean().plot(ax=ax, label=region)
plt.legend()
plt.show()

Functions¶

We have seen many examples of built-in functions (print, math functions, read_csv, etc are all functions). You can make your own function: takes certain inputs (arguments) and returns a value.

def print_date(year, month, day):
    joined = str(year)+'/'+str(month)+'/'+str(day)
    print(joined)

print_date(2007,11,31)

2007/11/31

You can return the value for use in other computations¶

def format_date(year,month,day):
    joined = str(year)+'/'+str(month)+'/'+str(day)
    return joined

format_date(2007,12,10)

'2007/12/10'

dates = []
for year in range(2007,2010):
    for month in range(5, 8):
        for day in range(12,20):
            dates.append(format_date(year,month,day))

dates

['2007/5/12',
 '2007/5/13',
 '2007/5/14',
 '2007/5/15',
 '2007/5/16',
 '2007/5/17',
 '2007/5/18',
 '2007/5/19',
 '2007/6/12',
 '2007/6/13',
 '2007/6/14',
 '2007/6/15',
 '2007/6/16',
 '2007/6/17',
 '2007/6/18',
 '2007/6/19',
 '2007/7/12',
 '2007/7/13',
 '2007/7/14',
 '2007/7/15',
 '2007/7/16',
 '2007/7/17',
 '2007/7/18',
 '2007/7/19',
 '2008/5/12',
 '2008/5/13',
 '2008/5/14',
 '2008/5/15',
 '2008/5/16',
 '2008/5/17',
 '2008/5/18',
 '2008/5/19',
 '2008/6/12',
 '2008/6/13',
 '2008/6/14',
 '2008/6/15',
 '2008/6/16',
 '2008/6/17',
 '2008/6/18',
 '2008/6/19',
 '2008/7/12',
 '2008/7/13',
 '2008/7/14',
 '2008/7/15',
 '2008/7/16',
 '2008/7/17',
 '2008/7/18',
 '2008/7/19',
 '2009/5/12',
 '2009/5/13',
 '2009/5/14',
 '2009/5/15',
 '2009/5/16',
 '2009/5/17',
 '2009/5/18',
 '2009/5/19',
 '2009/6/12',
 '2009/6/13',
 '2009/6/14',
 '2009/6/15',
 '2009/6/16',
 '2009/6/17',
 '2009/6/18',
 '2009/6/19',
 '2009/7/12',
 '2009/7/13',
 '2009/7/14',
 '2009/7/15',
 '2009/7/16',
 '2009/7/17',
 '2009/7/18',
 '2009/7/19']

functions and data analysis¶

Recall that we worked with the pandas dataframe gapminder_data.csv.

import matplotlib.pyplot as plt
plt.style.use('ggplot')

data = pd.read_csv('../data/gapminder_data.csv',index_col='country')

percapita = pd.pivot_table(data,index='country',columns='year',values='gdpPercap')

percapita.loc['Afghanistan'].plot(title='Afghanistan GDP Per capita over Time')

<matplotlib.axes._subplots.AxesSubplot at 0x1174d7940>

def country_gdp(country):
    title = country + ' GDP Per Capita over Time'
    percapita.loc[country].plot(figsize=(8,8),legend=True,title='GDP Per Capita')

country_gdp("United States")
country_gdp("Germany")
country_gdp("Switzerland")

for country in ['United States','China','Germany','Australia','Brazil']:
    country_gdp(country)

A random walk¶

import numpy.random as rnd

x = rnd.choice([-1,1])

def random_walk(N):
    spot = 0
    L = []
    for i in range(N):
        spot = spot + rnd.choice([-1,1])
        L.append(spot)
    return L

end_spots = []
for i in range(200):
    walk = random_walk(100)
    plt.plot(range(100),walk)
    end_spots.append(walk[-1])

s=plt.hist(end_spots,bins=10)

Variable scope¶

variables inside functions are "local" to the function and changes to them don't last after the function ends

the scope of a variable is the region where it is defined -- basically, in a function block and anything inside that.

def lister(x):
    L = [x]*10
    v = 47
    print(L)

print(L)
lister(4)
print(L)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

There are subtleties to variable scope. If a variable is NOT NAMED as an argument, then it is assumed to come from "outside" the function.

L = ['a','b','c']
def j(x):
    L.append(x)

j('u')

L

['a', 'b', 'c', 'u']

And structures (like lists, and arrays) CAN be modified inside a function EVEN IF they are given as an argument.

L = ['a','b','c']
def j(x,L):
    L.append(x)
j('x',L)
print(L)

['a', 'b', 'c', 'x']

The full set of rules is a bit complicated and we won't go into all the details.

Conditionals¶

def my_abs(x):
    if x<0:
        return -x
    else:
        return x

my_abs(-1)

1

This function returns a "tuple" which is a pair of lists. You can get the two lists with subscripts or with the A, B = construction

def split_threshold(threshold,L):
    Low = []
    High = []
    for item in L:
        if item<threshold:
            Low.append(item)
        else:
            High.append(item)
    return Low, High

T = split_threshold(0,[-1,-5,2,-13,11,100])
print(T)

([-1, -5, -13], [2, 11, 100])

print(T[0])

[-1, -5, -13]

print(T[1])

[2, 11, 100]

This is the syntax for unpacking a tuple

Low, High = split_threshold(0,[-1,-5,2,-13,11,100])

Low

[-1, -5, -13]

High

[2, 11, 100]

Python admits and and or operators

data['pop'].head()

country
Afghanistan     8425333.0
Afghanistan     9240934.0
Afghanistan    10267083.0
Afghanistan    11537966.0
Afghanistan    13079460.0
Name: pop, dtype: float64

for x in 'Jeremy Teitelbaum':
    if (x>='r' and x<='u'):
        print(x)

r
t
u

applying a function in pandas¶

population_size(300000000)

'medium'

data['pop_class']=data['pop'].apply(population_size)

data.head()

data[(data['year']==2002)].groupby('pop_class')['lifeExp'].mean().plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x1190232e8>

def life_exp_by_class_and_year(year):
    data[(data['year']==2002)].groupby('pop_class')['lifeExp'].mean().plot(kind='bar')

life_exp_by_class_and_year(1957)

data[(data['year']==2002) & (data['pop_class'] == 'large')]

docstrings¶

def threshold(x,L):
    """Returns (Low, High) where Low is the list of elements in L less than x, 
    and High is a list of those greater than or equal to x"""
    Low, High = [], []
    for item in L:
        if item<x:
            Low.append(item)
        else:
            High.append(item)
    return Low, High

?threshold

Signature: threshold(x, L)
Docstring:
Returns (Low, High) where Low is the list of elements in L less than x, 
and High is a list of those greater than or equal to x
File:      ~/GitHub/Carpentries/2020-01-13/notes/<ipython-input-254-5a9844fcc62c>
Type:      function

default arguments¶

def threshold(L,x=0):
    """Returns (Low, High) where Low is the list of elements in L less than x, 
    and High is a list of those greater than or equal to x.  x defaults to zero."""
    Low, High = [], []
    for item in L:
        if item<x:
            Low.append(item)
        else:
            High.append(item)
    return Low, High

threshold([1,-3,2,5])

([-3], [1, 2, 5])

?threshold

Signature: threshold(L, x=0)
Docstring:
Returns (Low, High) where Low is the list of elements in L less than x, 
and High is a list of those greater than or equal to x.  x defaults to zero.
File:      ~/GitHub/Carpentries/2020-01-13/notes/<ipython-input-256-bc8a5db254e7>
Type:      function

?print

Docstring:
print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)

Prints the values to a stream, or to sys.stdout by default.
Optional keyword arguments:
file:  a file-like object (stream); defaults to the current sys.stdout.
sep:   string inserted between values, default a space.
end:   string appended after the last value, default a newline.
flush: whether to forcibly flush the stream.
Type:      builtin_function_or_method

	year	pop	continent	lifeExp	gdpPercap	pop_class
country
Afghanistan	1952	8425333.0	Asia	28.801	779.445314	small
Afghanistan	1957	9240934.0	Asia	30.332	820.853030	small
Afghanistan	1962	10267083.0	Asia	31.997	853.100710	medium
Afghanistan	1967	11537966.0	Asia	34.020	836.197138	medium
Afghanistan	1972	13079460.0	Asia	36.088	739.981106	medium

	year	pop	continent	lifeExp	gdpPercap	pop_class
country
China	2002	1.280400e+09	Asia	72.028	3119.280896	large
India	2002	1.034173e+09	Asia	62.879	1746.769454	large