Grouping and Summarizing in Pandas

Two python techniques

  1. dictionaries
  2. lambda functions

Refactoring

See this python script file for an example of how to pull out code into a python function that standardizes your data preparation. To use this, you can use the %load command in a jupyter notebook. The command %load filename loads the content of the file into an executable cell. You need to execute that cell.

If you modify the script, you need to reload it.

Grouping and Summarizing

import pandas as pd
import  numpy as np

Our first pass will be with our old friends the penguins.

penguins = pd.read_csv("data/penguins-raw.csv")

Some basic data cleaning

  1. Data types
penguins.dtypes
studyName               object
Sample Number            int64
Species                 object
Region                  object
Island                  object
Stage                   object
Individual ID           object
Clutch Completion       object
Date Egg                object
Culmen Length (mm)     float64
Culmen Depth (mm)      float64
Flipper Length (mm)    float64
Body Mass (g)          float64
Sex                     object
Delta 15 N (o/oo)      float64
Delta 13 C (o/oo)      float64
Comments                object
dtype: object
  1. Focus on Species, Island, Sex, Culmen Length/Depth, Flipper Length, Body Mass.
focus = ['Species','Island','Sex','Culmen Length (mm)','Culmen Depth (mm)','Flipper Length (mm)', 'Body Mass (g)']
simplified = penguins[focus]
  1. Clean up column names
edited_columns = ['species','island','sex','culmen_length', 'culmen_depth','flipper_length','body_mass']
simplified.columns = edited_columns
  1. Simplify factor names. (Note use of dictionary)
species = simplified['species'].unique()
simple_species_dict={x:x.split(' ')[0].lower() for x in species}
simplified['species'].map(simple_species_dict)
0         adelie
1         adelie
2         adelie
3         adelie
4         adelie
         ...    
339    chinstrap
340    chinstrap
341    chinstrap
342    chinstrap
343    chinstrap
Name: species, Length: 344, dtype: object
  1. Remaking a column (watch out!)
#simplified['species'] = simplified['species'].map(simple_species_dict)

Old option: use .loc.

#simplified.loc[:,'species'] = simplified['species'].map(simple_species_dict)

Newer option: use .assign(). Notice that .assign() returns a dataframe.

simplified = simplified.assign(species = lambda x: x['species'].map(simple_species_dict))

Fix some other factor variables:

simplified  = simplified.assign(island = lambda x: x.island.str.lower())
simplified = simplified.assign(sex = lambda x: x['sex'].str.lower())
  1. Standardize the variables - column by column
#simplified = simplified.assign(culmen_length_std = lambda x: (x.culmen_length-x.culmen_length.mean())/x.culmen_length.std())

or make a standardization function. (note use of **)

def standardize(x):
    return (x-x.mean())/x.std()
simplified = simplified.assign(
    **{i+'_std':(lambda x: standardize(x[i])) for i in simplified.columns[3:]}
)
  1. Missing Values
simplified.isna().sum()
simplified = simplified.dropna(axis=0)
  1. Grouping

Grouping combines with aggregation.

numerical_variables = ['culmen_length','culmen_depth','flipper_length','body_mass']
by_sex_mean = simplified[['sex']+numerical_variables].groupby('sex').mean()

Alternatively one can use .agg

numerical_variables = ['culmen_length','culmen_depth','flipper_length','body_mass']
by_sex = simplified[['sex']+numerical_variables].groupby('sex').agg('mean')

And then get multiple aggregations.

by_sex = simplified[['sex']+numerical_variables].groupby('sex').agg(['count','mean','std'])

To access individual elements, use tuples as names.

by_sex.loc[:,('culmen_depth','mean')]
sex
female    16.425455
male      17.891071
Name: (culmen_depth, mean), dtype: float64

One can also group on multiple factors.

by_sex_and_species = (
    simplified[["sex", "species"] + numerical_variables]
    .groupby(["sex", "species"])
    .mean()
)
females_by_species = by_sex_and_species.loc[("female",)]
males_by_species = by_sex_and_species.loc[("male",)]

You can skip levels in the hierarchy using slice(None):

by_sex_and_species.loc[(slice('female'),slice('adelie','chinstrap')),:]
culmen_length culmen_depth flipper_length body_mass
sex species
female adelie 37.257534 17.621918 187.794521 3368.835616
chinstrap 46.573529 17.588235 191.735294 3527.205882
  1. Pivot tables
expanded = by_sex_and_species.reset_index()
expanded.pivot(index='sex',columns='species',values='culmen_length')
species adelie chinstrap gentoo
sex
female 37.257534 46.573529 45.563793
male 40.390411 51.094118 49.473770
pd.pivot_table(simplified,values='culmen_length',index='sex',columns='species',aggfunc='mean')
species adelie chinstrap gentoo
sex
female 37.257534 46.573529 45.563793
male 40.390411 51.094118 49.473770
  1. Making a function
def ptable(value, aggfunc="mean"):
    return pd.pivot_table(
        simplified, values=value, index="sex", columns="species", aggfunc=aggfunc
    )


ptable("body_mass", "std")
species adelie chinstrap gentoo
sex
female 269.380102 285.333912 281.578294
male 346.811553 362.137550 313.158596