import pandas as pd
import numpy as np
Grouping and Summarizing in Pandas
Two python techniques
- dictionaries
- lambda functions
Refactoring
See this python script file for an example of how to pull out code into a python function that standardizes your data preparation. To use this, you can use the %load
command in a jupyter notebook. The command %load filename
loads the content of the file into an executable cell. You need to execute that cell.
If you modify the script, you need to reload it.
Grouping and Summarizing
Our first pass will be with our old friends the penguins.
= pd.read_csv("data/penguins-raw.csv") penguins
Some basic data cleaning
- Data types
penguins.dtypes
studyName object
Sample Number int64
Species object
Region object
Island object
Stage object
Individual ID object
Clutch Completion object
Date Egg object
Culmen Length (mm) float64
Culmen Depth (mm) float64
Flipper Length (mm) float64
Body Mass (g) float64
Sex object
Delta 15 N (o/oo) float64
Delta 13 C (o/oo) float64
Comments object
dtype: object
- Focus on Species, Island, Sex, Culmen Length/Depth, Flipper Length, Body Mass.
= ['Species','Island','Sex','Culmen Length (mm)','Culmen Depth (mm)','Flipper Length (mm)', 'Body Mass (g)']
focus = penguins[focus] simplified
- Clean up column names
= ['species','island','sex','culmen_length', 'culmen_depth','flipper_length','body_mass']
edited_columns = edited_columns simplified.columns
- Simplify factor names. (Note use of dictionary)
= simplified['species'].unique()
species ={x:x.split(' ')[0].lower() for x in species}
simple_species_dict'species'].map(simple_species_dict) simplified[
0 adelie
1 adelie
2 adelie
3 adelie
4 adelie
...
339 chinstrap
340 chinstrap
341 chinstrap
342 chinstrap
343 chinstrap
Name: species, Length: 344, dtype: object
- Remaking a column (watch out!)
#simplified['species'] = simplified['species'].map(simple_species_dict)
Old option: use .loc.
#simplified.loc[:,'species'] = simplified['species'].map(simple_species_dict)
Newer option: use .assign(). Notice that .assign() returns a dataframe.
= simplified.assign(species = lambda x: x['species'].map(simple_species_dict)) simplified
Fix some other factor variables:
= simplified.assign(island = lambda x: x.island.str.lower())
simplified = simplified.assign(sex = lambda x: x['sex'].str.lower()) simplified
- Standardize the variables - column by column
#simplified = simplified.assign(culmen_length_std = lambda x: (x.culmen_length-x.culmen_length.mean())/x.culmen_length.std())
or make a standardization function. (note use of **)
def standardize(x):
return (x-x.mean())/x.std()
= simplified.assign(
simplified **{i+'_std':(lambda x: standardize(x[i])) for i in simplified.columns[3:]}
)
- Missing Values
sum()
simplified.isna().= simplified.dropna(axis=0) simplified
- Grouping
Grouping combines with aggregation.
= ['culmen_length','culmen_depth','flipper_length','body_mass']
numerical_variables = simplified[['sex']+numerical_variables].groupby('sex').mean() by_sex_mean
Alternatively one can use .agg
= ['culmen_length','culmen_depth','flipper_length','body_mass']
numerical_variables = simplified[['sex']+numerical_variables].groupby('sex').agg('mean') by_sex
And then get multiple aggregations.
= simplified[['sex']+numerical_variables].groupby('sex').agg(['count','mean','std']) by_sex
To access individual elements, use tuples as names.
'culmen_depth','mean')] by_sex.loc[:,(
sex
female 16.425455
male 17.891071
Name: (culmen_depth, mean), dtype: float64
One can also group on multiple factors.
= (
by_sex_and_species "sex", "species"] + numerical_variables]
simplified[["sex", "species"])
.groupby([
.mean()
)= by_sex_and_species.loc[("female",)]
females_by_species = by_sex_and_species.loc[("male",)] males_by_species
You can skip levels in the hierarchy using slice(None)
:
slice('female'),slice('adelie','chinstrap')),:] by_sex_and_species.loc[(
culmen_length | culmen_depth | flipper_length | body_mass | ||
---|---|---|---|---|---|
sex | species | ||||
female | adelie | 37.257534 | 17.621918 | 187.794521 | 3368.835616 |
chinstrap | 46.573529 | 17.588235 | 191.735294 | 3527.205882 |
- Pivot tables
= by_sex_and_species.reset_index()
expanded ='sex',columns='species',values='culmen_length') expanded.pivot(index
species | adelie | chinstrap | gentoo |
---|---|---|---|
sex | |||
female | 37.257534 | 46.573529 | 45.563793 |
male | 40.390411 | 51.094118 | 49.473770 |
='culmen_length',index='sex',columns='species',aggfunc='mean') pd.pivot_table(simplified,values
species | adelie | chinstrap | gentoo |
---|---|---|---|
sex | |||
female | 37.257534 | 46.573529 | 45.563793 |
male | 40.390411 | 51.094118 | 49.473770 |
- Making a function
def ptable(value, aggfunc="mean"):
return pd.pivot_table(
=value, index="sex", columns="species", aggfunc=aggfunc
simplified, values
)
"body_mass", "std") ptable(
species | adelie | chinstrap | gentoo |
---|---|---|---|
sex | |||
female | 269.380102 | 285.333912 | 281.578294 |
male | 346.811553 | 362.137550 | 313.158596 |