Grouping and Summarizing in Pandas

Two python techniques

dictionaries
lambda functions

Refactoring

See this python script file for an example of how to pull out code into a python function that standardizes your data preparation. To use this, you can use the %load command in a jupyter notebook. The command %load filename loads the content of the file into an executable cell. You need to execute that cell.

If you modify the script, you need to reload it.

Grouping and Summarizing

import pandas as pd
import  numpy as np

Our first pass will be with our old friends the penguins.

penguins = pd.read_csv("data/penguins-raw.csv")

Some basic data cleaning

Data types

penguins.dtypes

studyName               object
Sample Number            int64
Species                 object
Region                  object
Island                  object
Stage                   object
Individual ID           object
Clutch Completion       object
Date Egg                object
Culmen Length (mm)     float64
Culmen Depth (mm)      float64
Flipper Length (mm)    float64
Body Mass (g)          float64
Sex                     object
Delta 15 N (o/oo)      float64
Delta 13 C (o/oo)      float64
Comments                object
dtype: object

Focus on Species, Island, Sex, Culmen Length/Depth, Flipper Length, Body Mass.

focus = ['Species','Island','Sex','Culmen Length (mm)','Culmen Depth (mm)','Flipper Length (mm)', 'Body Mass (g)']
simplified = penguins[focus]

Clean up column names

edited_columns = ['species','island','sex','culmen_length', 'culmen_depth','flipper_length','body_mass']
simplified.columns = edited_columns

Simplify factor names. (Note use of dictionary)

species = simplified['species'].unique()
simple_species_dict={x:x.split(' ')[0].lower() for x in species}
simplified['species'].map(simple_species_dict)

0         adelie
1         adelie
2         adelie
3         adelie
4         adelie
         ...    
339    chinstrap
340    chinstrap
341    chinstrap
342    chinstrap
343    chinstrap
Name: species, Length: 344, dtype: object

Remaking a column (watch out!)

#simplified['species'] = simplified['species'].map(simple_species_dict)

Old option: use .loc.

#simplified.loc[:,'species'] = simplified['species'].map(simple_species_dict)

Newer option: use .assign(). Notice that .assign() returns a dataframe.

simplified = simplified.assign(species = lambda x: x['species'].map(simple_species_dict))

Fix some other factor variables:

simplified  = simplified.assign(island = lambda x: x.island.str.lower())
simplified = simplified.assign(sex = lambda x: x['sex'].str.lower())

Standardize the variables - column by column

#simplified = simplified.assign(culmen_length_std = lambda x: (x.culmen_length-x.culmen_length.mean())/x.culmen_length.std())

or make a standardization function. (note use of **)

def standardize(x):
    return (x-x.mean())/x.std()
simplified = simplified.assign(
    **{i+'_std':(lambda x: standardize(x[i])) for i in simplified.columns[3:]}
)

Missing Values

simplified.isna().sum()
simplified = simplified.dropna(axis=0)

Grouping

Grouping combines with aggregation.

numerical_variables = ['culmen_length','culmen_depth','flipper_length','body_mass']
by_sex_mean = simplified[['sex']+numerical_variables].groupby('sex').mean()

Alternatively one can use .agg

numerical_variables = ['culmen_length','culmen_depth','flipper_length','body_mass']
by_sex = simplified[['sex']+numerical_variables].groupby('sex').agg('mean')

And then get multiple aggregations.

by_sex = simplified[['sex']+numerical_variables].groupby('sex').agg(['count','mean','std'])

To access individual elements, use tuples as names.

by_sex.loc[:,('culmen_depth','mean')]

sex
female    16.425455
male      17.891071
Name: (culmen_depth, mean), dtype: float64

One can also group on multiple factors.

by_sex_and_species = (
    simplified[["sex", "species"] + numerical_variables]
    .groupby(["sex", "species"])
    .mean()
)
females_by_species = by_sex_and_species.loc[("female",)]
males_by_species = by_sex_and_species.loc[("male",)]

You can skip levels in the hierarchy using slice(None):

by_sex_and_species.loc[(slice('female'),slice('adelie','chinstrap')),:]

		culmen_length	culmen_depth	flipper_length	body_mass
sex	species
female	adelie	37.257534	17.621918	187.794521	3368.835616
female	chinstrap	46.573529	17.588235	191.735294	3527.205882

Pivot tables

expanded = by_sex_and_species.reset_index()
expanded.pivot(index='sex',columns='species',values='culmen_length')

species	adelie	chinstrap	gentoo
sex
female	37.257534	46.573529	45.563793
male	40.390411	51.094118	49.473770

pd.pivot_table(simplified,values='culmen_length',index='sex',columns='species',aggfunc='mean')

species	adelie	chinstrap	gentoo
sex
female	37.257534	46.573529	45.563793
male	40.390411	51.094118	49.473770

Making a function

def ptable(value, aggfunc="mean"):
    return pd.pivot_table(
        simplified, values=value, index="sex", columns="species", aggfunc=aggfunc
    )


ptable("body_mass", "std")

species	adelie	chinstrap	gentoo
sex
female	269.380102	285.333912	281.578294
male	346.811553	362.137550	313.158596