Working with data in python/pandas

Fundamentals of Data Science

Author

Jeremy Teitelbaum

Loading the key libraries

import sys
import pandas as pd
import numpy as np
import re

print(f"pandas version {pd.__version__}")
print(f"numpy version {np.__version__}")
print("\n".join(f"Python {sys.version}".split("|")))

pandas version 2.1.1
numpy version 1.24.3
Python 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]

A comment on file formats

The most common simple format for tabular data is comma separated or tab separated (csv or tsv).

Newer formats such as arrow and parquet are more efficient in storage and faster to load.

Pandas 2.0 can handle these newer formats.

Reading a dataframe

# read from a csv file
penguins = pd.read_csv("data/penguins-raw.csv")
# read from a url
# url = "https://raw.githubusercontent.com/mcnakhaee/palmerpenguins/master/palmerpenguins/data/penguins-raw.csv"
# penguins = pd.read_csv(url)
# read from an excel file
# penguins = pd.read_excel('file.xlsx')
rows, cols = penguins.shape
print(f"Rows: {rows}, Columns: {cols}")
print(f"Columns:", "\n".join(penguins.columns))

Rows: 344, Columns: 17
Columns: studyName
Sample Number
Species
Region
Island
Stage
Individual ID
Clutch Completion
Date Egg
Culmen Length (mm)
Culmen Depth (mm)
Flipper Length (mm)
Body Mass (g)
Sex
Delta 15 N (o/oo)
Delta 13 C (o/oo)
Comments

Series

Each column of a dataframe is a series accessed by name.

penguins["Culmen Length (mm)"]

0      39.1
1      39.5
2      40.3
3       NaN
4      36.7
       ... 
339    55.8
340    43.5
341    49.6
342    50.8
343    50.2
Name: Culmen Length (mm), Length: 344, dtype: float64

Note the last row: - Name - Length - dtype

Types are “inferred” by the read_csv function.

Another example

penguins['Date Egg']

0      2007-11-11
1      2007-11-11
2      2007-11-16
3      2007-11-16
4      2007-11-16
          ...    
339    2009-11-19
340    2009-11-21
341    2009-11-21
342    2009-11-21
343    2009-11-21
Name: Date Egg, Length: 344, dtype: object

Here the type is “object” which is the generic python object. But these are clearly supposed to be dates. We’ll fix that later.

Alternative syntax

# if the column name is simple, you can use a simpler syntax.
penguins.Sex

0        MALE
1      FEMALE
2      FEMALE
3         NaN
4      FEMALE
        ...  
339      MALE
340    FEMALE
341      MALE
342      MALE
343    FEMALE
Name: Sex, Length: 344, dtype: object

Value Counts

The value_counts method returns a summary series.

penguins['Island'].value_counts()

Island
Biscoe       168
Dream        124
Torgersen     52
Name: count, dtype: int64

penguins['Species'].value_counts()

Species
Adelie Penguin (Pygoscelis adeliae)          152
Gentoo penguin (Pygoscelis papua)            124
Chinstrap penguin (Pygoscelis antarctica)     68
Name: count, dtype: int64

Selecting a subset of columns

simpler = penguins[['Species', 'Body Mass (g)', 'Flipper Length (mm)']]
simpler.head()

	Species	Body Mass (g)	Flipper Length (mm)
0	Adelie Penguin (Pygoscelis adeliae)	3750.0	181.0
1	Adelie Penguin (Pygoscelis adeliae)	3800.0	186.0
2	Adelie Penguin (Pygoscelis adeliae)	3250.0	195.0
3	Adelie Penguin (Pygoscelis adeliae)	NaN	NaN
4	Adelie Penguin (Pygoscelis adeliae)	3450.0	193.0

Index

A dataframe has an index, which can be just the numbers from 0 to N as in this case.

penguins.index

RangeIndex(start=0, stop=344, step=1)

Columns and Rows

loc allows you to access individual elements.

# The 23rd row
penguins.loc[23,:]

studyName                                          PAL0708
Sample Number                                           24
Species                Adelie Penguin (Pygoscelis adeliae)
Region                                              Anvers
Island                                              Biscoe
Stage                                   Adult, 1 Egg Stage
Individual ID                                        N12A2
Clutch Completion                                      Yes
Date Egg                                        2007-11-12
Culmen Length (mm)                                    38.2
Culmen Depth (mm)                                     18.1
Flipper Length (mm)                                  185.0
Body Mass (g)                                       3950.0
Sex                                                   MALE
Delta 15 N (o/oo)                                  8.43423
Delta 13 C (o/oo)                                -25.22664
Comments                                               NaN
Name: 23, dtype: object

penguins.loc[23,'Culmen Length (mm)']

38.2

penguins.loc[23:28,['Sex','Date Egg']]

	Sex	Date Egg
23	MALE	2007-11-12
24	MALE	2007-11-10
25	FEMALE	2007-11-10
26	MALE	2007-11-12
27	FEMALE	2007-11-12
28	FEMALE	2007-11-10

Filtering

Filtering is done by using a boolean series as an index.

penguins['Sex']=='FEMALE'

0      False
1       True
2       True
3      False
4       True
       ...  
339    False
340     True
341    False
342    False
343     True
Name: Sex, Length: 344, dtype: bool

females = penguins[penguins['Sex']=='FEMALE']
females.head()

	studyName	Sample Number	Species	Region	Island	Stage	Individual ID	Clutch Completion	Date Egg	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)	Sex	Delta 15 N (o/oo)	Delta 13 C (o/oo)	Comments
1	PAL0708	2	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A2	Yes	2007-11-11	39.5	17.4	186.0	3800.0	FEMALE	8.94956	-24.69454	NaN
2	PAL0708	3	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A1	Yes	2007-11-16	40.3	18.0	195.0	3250.0	FEMALE	8.36821	-25.33302	NaN
4	PAL0708	5	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N3A1	Yes	2007-11-16	36.7	19.3	193.0	3450.0	FEMALE	8.76651	-25.32426	NaN
6	PAL0708	7	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N4A1	No	2007-11-15	38.9	17.8	181.0	3625.0	FEMALE	9.18718	-25.21799	Nest never observed with full clutch.
12	PAL0708	13	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N7A1	Yes	2007-11-15	41.1	17.6	182.0	3200.0	FEMALE	NaN	NaN	Not enough blood for isotopes.

An alternative syntax is to use query. The quoting rules here can be tricky. The query is a string, and column names are set off by backticks. Using two different types of quotes allows the query to include a string.

females = penguins.query("`Sex`=='FEMALE'")
females.head()

	studyName	Sample Number	Species	Region	Island	Stage	Individual ID	Clutch Completion	Date Egg	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)	Sex	Delta 15 N (o/oo)	Delta 13 C (o/oo)	Comments
1	PAL0708	2	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A2	Yes	2007-11-11	39.5	17.4	186.0	3800.0	FEMALE	8.94956	-24.69454	NaN
2	PAL0708	3	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A1	Yes	2007-11-16	40.3	18.0	195.0	3250.0	FEMALE	8.36821	-25.33302	NaN
4	PAL0708	5	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N3A1	Yes	2007-11-16	36.7	19.3	193.0	3450.0	FEMALE	8.76651	-25.32426	NaN
6	PAL0708	7	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N4A1	No	2007-11-15	38.9	17.8	181.0	3625.0	FEMALE	9.18718	-25.21799	Nest never observed with full clutch.
12	PAL0708	13	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N7A1	Yes	2007-11-15	41.1	17.6	182.0	3200.0	FEMALE	NaN	NaN	Not enough blood for isotopes.

Fancier filtering

penguins[penguins["Flipper Length (mm)"]>penguins["Body Mass (g)"]/20]

	studyName	Sample Number	Species	Region	Island	Stage	Individual ID	Clutch Completion	Date Egg	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)	Sex	Delta 15 N (o/oo)	Delta 13 C (o/oo)	Comments
2	PAL0708	3	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A1	Yes	2007-11-16	40.3	18.0	195.0	3250.0	FEMALE	8.36821	-25.33302	NaN
4	PAL0708	5	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N3A1	Yes	2007-11-16	36.7	19.3	193.0	3450.0	FEMALE	8.76651	-25.32426	NaN
5	PAL0708	6	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N3A2	Yes	2007-11-16	39.3	20.6	190.0	3650.0	MALE	8.66496	-25.29805	NaN
8	PAL0708	9	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N5A1	Yes	2007-11-09	34.1	18.1	193.0	3475.0	NaN	NaN	NaN	No blood sample obtained.
10	PAL0708	11	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N6A1	Yes	2007-11-09	37.8	17.1	186.0	3300.0	NaN	8.63243	-25.21315	No blood sample obtained for sexing.
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
339	PAL0910	64	Chinstrap penguin (Pygoscelis antarctica)	Anvers	Dream	Adult, 1 Egg Stage	N98A2	Yes	2009-11-19	55.8	19.8	207.0	4000.0	MALE	9.70465	-24.53494	NaN
340	PAL0910	65	Chinstrap penguin (Pygoscelis antarctica)	Anvers	Dream	Adult, 1 Egg Stage	N99A1	No	2009-11-21	43.5	18.1	202.0	3400.0	FEMALE	9.37608	-24.40753	Nest never observed with full clutch.
341	PAL0910	66	Chinstrap penguin (Pygoscelis antarctica)	Anvers	Dream	Adult, 1 Egg Stage	N99A2	No	2009-11-21	49.6	18.2	193.0	3775.0	MALE	9.46180	-24.70615	Nest never observed with full clutch.
342	PAL0910	67	Chinstrap penguin (Pygoscelis antarctica)	Anvers	Dream	Adult, 1 Egg Stage	N100A1	Yes	2009-11-21	50.8	19.0	210.0	4100.0	MALE	9.98044	-24.68741	NaN
343	PAL0910	68	Chinstrap penguin (Pygoscelis antarctica)	Anvers	Dream	Adult, 1 Egg Stage	N100A2	Yes	2009-11-21	50.2	18.7	198.0	3775.0	FEMALE	9.39305	-24.25255	NaN

147 rows × 17 columns

Missing values

Dealing with missing values is a central problem in data science. One way to identify how many misssing values are out there is as follows:

## Uses the fact that logical True counts as one, False as zero
## sum() method sums by columns
penguins.isna().sum()

studyName                0
Sample Number            0
Species                  0
Region                   0
Island                   0
Stage                    0
Individual ID            0
Clutch Completion        0
Date Egg                 0
Culmen Length (mm)       2
Culmen Depth (mm)        2
Flipper Length (mm)      2
Body Mass (g)            2
Sex                     11
Delta 15 N (o/oo)       14
Delta 13 C (o/oo)       13
Comments               290
dtype: int64

Nearly all of the comments are empty. What are they?

comments = penguins['Comments'].value_counts()
comments

Comments
Nest never observed with full clutch.                                   34
Not enough blood for isotopes.                                           7
Sexing primers did not amplify.                                          4
No blood sample obtained.                                                2
No blood sample obtained for sexing.                                     2
Adult not sampled.                                                       1
Nest never observed with full clutch. Not enough blood for isotopes.     1
Sexing primers did not amplify. Not enough blood for isotopes.           1
Adult not sampled. Nest never observed with full clutch.                 1
No delta15N data received from lab.                                      1
Name: count, dtype: int64

Let’s save the comments separately and look at the rest.

# drop normally drops rows, but with axis=1 it drops columns
penguins = penguins.drop('Comments',axis=1)

Various options: - drop rows with missing values - impute the missing values somehow

Drop rows with missing values

# This makes a boolean where a row is True provided at least one of its entries are NA
na_rows = (penguins.isna().any(axis=1))
print(f"{na_rows.sum()} rows have NA somewhere outside of comments")

20 rows have NA somewhere outside of comments

# here we keep rows only if no NA's.  Can also use notna().
penguins_nona = penguins.loc[~na_rows,:]

Imputation

We saw above that culmen length has 2 missing values. We can use fillna to replace the missing values with something (like the mean or median or zero).

# using equality w/o copy creates another reference.
penguins_imputed = penguins.copy()
culmen_mean = penguins_imputed['Culmen Length (mm)'].mean() # how does this handle NA values?
print(f"Culmen length mean is {culmen_mean}")
penguins_imputed['Culmen Length (mm)'] = penguins_imputed['Culmen Length (mm)'].fillna(culmen_mean)

Culmen length mean is 43.9219298245614

There are many other imputation methods. For example, if the data is ordered, you can fill missing data with linear interpolation. (See the interpolate method).

Data types

As we saw above, the types of the columns are inferred when the data is read. But it’s not always correct. For example, the “Date Egg” column is supposed to be a date but it’s shown as a generic python object.

Using the correct type can greatly improve performance as generic Python arguments are inefficient.

In pandas 1.0 strings are always treated as objects but in pandas 2.0 there is a StringDtype.

The most common types are: - object - float64 - datetime (datetime64[ns]) - int64 - bool

One may also find categorical types.

penguins.dtypes

studyName               object
Sample Number            int64
Species                 object
Region                  object
Island                  object
Stage                   object
Individual ID           object
Clutch Completion       object
Date Egg                object
Culmen Length (mm)     float64
Culmen Depth (mm)      float64
Flipper Length (mm)    float64
Body Mass (g)          float64
Sex                     object
Delta 15 N (o/oo)      float64
Delta 13 C (o/oo)      float64
dtype: object

Setting datatypes

Here’s an example where we manually make sex a categorical type.

penguins = penguins.astype({'Sex':'category'})
penguins['Sex']

0        MALE
1      FEMALE
2      FEMALE
3         NaN
4      FEMALE
        ...  
339      MALE
340    FEMALE
341      MALE
342      MALE
343    FEMALE
Name: Sex, Length: 344, dtype: category
Categories (2, object): ['FEMALE', 'MALE']

One can also pass a dictionary setting the types of columns as an argument when you read them from the csv file.

Creating new columns

Simplifying the species name.

def first_word(x):
    return x.split()[0]
penguins['SimpleSpecies'] = penguins['Species'].map(first_word)

Rewriting the body mass in kilograms.

penguins['Body Mass (kg)'] = penguins['Body Mass (g)']/1000

Sorting

penguins_small = penguins[['Species','Island','Body Mass (g)']]
penguins_small.sort_values('Body Mass (g)')
# ascending = False for descending order
# na_position = 'first' or 'last' (default is 'last')
# can also provide a key which is a function of prototype Series -> Series
# inplace = True doesn't return a new dataframe, sorts the given one in place

	Species	Island	Body Mass (g)
314	Chinstrap penguin (Pygoscelis antarctica)	Dream	2700.0
64	Adelie Penguin (Pygoscelis adeliae)	Biscoe	2850.0
58	Adelie Penguin (Pygoscelis adeliae)	Biscoe	2850.0
116	Adelie Penguin (Pygoscelis adeliae)	Torgersen	2900.0
98	Adelie Penguin (Pygoscelis adeliae)	Dream	2900.0
...	...	...	...
269	Gentoo penguin (Pygoscelis papua)	Biscoe	6000.0
185	Gentoo penguin (Pygoscelis papua)	Biscoe	6050.0
169	Gentoo penguin (Pygoscelis papua)	Biscoe	6300.0
3	Adelie Penguin (Pygoscelis adeliae)	Torgersen	NaN
271	Gentoo penguin (Pygoscelis papua)	Biscoe	NaN

344 rows × 3 columns

Grouping

Grouping is a powerful tool. Let’s first group our penguins by species. The result is a “grouped” object which needs to pass through a summarize operation to be useful.

penguins_by_species = penguins.groupby('Species')
penguins_by_species

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7aca5c487250>

Summarizing

# describe computes basic descriptive statistics
penguins_by_species['Body Mass (g)'].describe()

	count	mean	std	min	25%	50%	75%	max
Species
Adelie Penguin (Pygoscelis adeliae)	151.0	3700.662252	458.566126	2850.0	3350.0	3700.0	4000.0	4775.0
Chinstrap penguin (Pygoscelis antarctica)	68.0	3733.088235	384.335081	2700.0	3487.5	3700.0	3950.0	4800.0
Gentoo penguin (Pygoscelis papua)	123.0	5076.016260	504.116237	3950.0	4700.0	5000.0	5500.0	6300.0

# this fails because some columns aren't numeric
penguins_by_species.mean(numeric_only=True)

	Sample Number	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)	Delta 15 N (o/oo)	Delta 13 C (o/oo)	Body Mass (kg)
Species
Adelie Penguin (Pygoscelis adeliae)	76.5	38.791391	18.346358	189.953642	3700.662252	8.859733	-25.804194	3.700662
Chinstrap penguin (Pygoscelis antarctica)	34.5	48.833824	18.420588	195.823529	3733.088235	9.356155	-24.546542	3.733088
Gentoo penguin (Pygoscelis papua)	62.5	47.504878	14.982114	217.186992	5076.016260	8.245338	-26.185298	5.076016

penguins_by_species.count()

	studyName	Sample Number	Region	Island	Stage	Individual ID	Clutch Completion	Date Egg	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)	Sex	Delta 15 N (o/oo)	Delta 13 C (o/oo)	SimpleSpecies	Body Mass (kg)
Species
Adelie Penguin (Pygoscelis adeliae)	152	152	152	152	152	152	152	152	151	151	151	151	146	141	141	152	151
Chinstrap penguin (Pygoscelis antarctica)	68	68	68	68	68	68	68	68	68	68	68	68	68	67	68	68	68
Gentoo penguin (Pygoscelis papua)	124	124	124	124	124	124	124	124	123	123	123	123	119	122	122	124	123

MultiIndex

penguins_by_sex_and_species = penguins.groupby(['Sex','Species'])
penguins_by_sex_and_species['Body Mass (g)'].describe().round()

/tmp/ipykernel_61635/3464751443.py:1: FutureWarning:

The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.

		count	mean	std	min	25%	50%	75%	max
Sex	Species
FEMALE	Adelie Penguin (Pygoscelis adeliae)	73.0	3369.0	269.0	2850.0	3175.0	3400.0	3550.0	3900.0
	Chinstrap penguin (Pygoscelis antarctica)	34.0	3527.0	285.0	2700.0	3362.0	3550.0	3694.0	4150.0
	Gentoo penguin (Pygoscelis papua)	58.0	4680.0	282.0	3950.0	4462.0	4700.0	4875.0	5200.0
MALE	Adelie Penguin (Pygoscelis adeliae)	73.0	4043.0	347.0	3325.0	3800.0	4000.0	4300.0	4775.0
	Chinstrap penguin (Pygoscelis antarctica)	34.0	3939.0	362.0	3250.0	3731.0	3950.0	4100.0	4800.0
	Gentoo penguin (Pygoscelis papua)	61.0	5485.0	313.0	4750.0	5300.0	5500.0	5700.0	6300.0

# pivot tables
penguins_by_sex_and_species['Body Mass (g)'].mean().reset_index().pivot(index='Sex',columns='Species',values='Body Mass (g)')

Species	Adelie Penguin (Pygoscelis adeliae)	Chinstrap penguin (Pygoscelis antarctica)	Gentoo penguin (Pygoscelis papua)
Sex
FEMALE	3368.835616	3527.205882	4679.741379
MALE	4043.493151	3938.970588	5484.836066

Pandas plotting

Some simple plots are available directly from pandas.

penguins[penguins['Species'].str.startswith("Adel")].groupby(['Sex'])['Body Mass (g)'].hist(bins=30,legend=True)

/tmp/ipykernel_61635/3368902341.py:1: FutureWarning:

The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.

Sex
FEMALE    Axes(0.125,0.11;0.775x0.77)
MALE      Axes(0.125,0.11;0.775x0.77)
Name: Body Mass (g), dtype: object

Excel files

We can read an excel file. This particular one is complicated for various reasons, including the fact that the column heads are in the third row, not at the top. Also there are a bunch of footnotes starting in row 510 that we don’t want. So we don’t read them in.

crime2019 = pd.read_excel("data/Violent Crime-by state-2019-table-5.xls",header=3,nrows=510)
crime2019

	State	Area	Unnamed: 2	Population	Violent \ncrime1	Murder and \nnonnegligent \nmanslaughter	Rape2	Robbery	Aggravated \nassault	Property \ncrime	Burglary	Larceny-theft	Motor \nvehicle \ntheft
0	ALABAMA	Metropolitan Statistical Area	NaN	3728978.000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	Area actually reporting	0.766	12880	182.0	1141.0	1706.0	9851	65789	12388.0	47299.0	6102.0
2	NaN	NaN	Estimated total	1.000	19951	300.0	1542.0	3432.0	14677	104658	20728.0	73857.0	10073.0
3	NaN	Cities outside metropolitan areas	NaN	528518.000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	Area actually reporting	0.893	3327	36.0	297.0	266.0	2728	17915	3140.0	13382.0	1393.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
505	NaN	Nonmetropolitan counties	NaN	160615.000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
506	NaN	NaN	Area actually reporting	0.921	194	0.0	39.0	0.0	155	973	170.0	716.0	87.0
507	NaN	NaN	Estimated total	1.000	213	0.0	42.0	0.0	171	1065	188.0	781.0	96.0
508	NaN	State Total	NaN	578759.000	1258	13.0	324.0	67.0	854	9093	1396.0	6984.0	713.0
509	NaN	NaN	Rate per 100,000 inhabitants	NaN	217.4	2.2	56.0	11.6	147.6	1571.1	241.2	1206.7	123.2

510 rows × 13 columns

The column names have newlines in them and we’d like to get rid of those.

crime2019.columns = [x.replace(" \n","_") for x in crime2019.columns]
crime2019

	State	Area	Unnamed: 2	Population	Violent_crime1	Murder and_nonnegligent_manslaughter	Rape2	Robbery	Aggravated_assault	Property_crime	Burglary	Larceny-theft	Motor_vehicle_theft
0	ALABAMA	Metropolitan Statistical Area	NaN	3728978.000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	Area actually reporting	0.766	12880	182.0	1141.0	1706.0	9851	65789	12388.0	47299.0	6102.0
2	NaN	NaN	Estimated total	1.000	19951	300.0	1542.0	3432.0	14677	104658	20728.0	73857.0	10073.0
3	NaN	Cities outside metropolitan areas	NaN	528518.000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	Area actually reporting	0.893	3327	36.0	297.0	266.0	2728	17915	3140.0	13382.0	1393.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
505	NaN	Nonmetropolitan counties	NaN	160615.000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
506	NaN	NaN	Area actually reporting	0.921	194	0.0	39.0	0.0	155	973	170.0	716.0	87.0
507	NaN	NaN	Estimated total	1.000	213	0.0	42.0	0.0	171	1065	188.0	781.0	96.0
508	NaN	State Total	NaN	578759.000	1258	13.0	324.0	67.0	854	9093	1396.0	6984.0	713.0
509	NaN	NaN	Rate per 100,000 inhabitants	NaN	217.4	2.2	56.0	11.6	147.6	1571.1	241.2	1206.7	123.2

510 rows × 13 columns

Let’s look at the states.

states = crime2019["State"].dropna().values
states

array(['ALABAMA', 'ALASKA', 'ARIZONA', 'ARKANSAS', 'CALIFORNIA',
       'COLORADO', 'CONNECTICUT', 'DELAWARE', 'DISTRICT OF COLUMBIA3',
       'FLORIDA', 'GEORGIA', 'HAWAII', 'IDAHO', 'ILLINOIS', 'INDIANA',
       'IOWA', 'KANSAS', 'KENTUCKY', 'LOUISIANA', 'MAINE', 'MARYLAND',
       'MASSACHUSETTS', 'MICHIGAN', 'MINNESOTA', 'MISSISSIPPI4',
       'MISSOURI', 'MONTANA', 'NEBRASKA', 'NEVADA', 'NEW HAMPSHIRE',
       'NEW JERSEY', 'NEW MEXICO', 'NEW YORK', 'NORTH CAROLINA',
       'NORTH DAKOTA', 'OHIO4', 'OKLAHOMA', 'OREGON4', 'PENNSYLVANIA',
       'PUERTO RICO', 'RHODE ISLAND', 'SOUTH CAROLINA', 'SOUTH DAKOTA',
       'TENNESSEE', 'TEXAS', 'UTAH', 'VERMONT', 'VIRGINIA', 'WASHINGTON',
       'WEST VIRGINIA', 'WISCONSIN', 'WYOMING'], dtype=object)

Some have footnotes at the end. We don’t want them.

states = [re.sub("[0-9$]","",x) for x in states]
states

['ALABAMA',
 'ALASKA',
 'ARIZONA',
 'ARKANSAS',
 'CALIFORNIA',
 'COLORADO',
 'CONNECTICUT',
 'DELAWARE',
 'DISTRICT OF COLUMBIA',
 'FLORIDA',
 'GEORGIA',
 'HAWAII',
 'IDAHO',
 'ILLINOIS',
 'INDIANA',
 'IOWA',
 'KANSAS',
 'KENTUCKY',
 'LOUISIANA',
 'MAINE',
 'MARYLAND',
 'MASSACHUSETTS',
 'MICHIGAN',
 'MINNESOTA',
 'MISSISSIPPI',
 'MISSOURI',
 'MONTANA',
 'NEBRASKA',
 'NEVADA',
 'NEW HAMPSHIRE',
 'NEW JERSEY',
 'NEW MEXICO',
 'NEW YORK',
 'NORTH CAROLINA',
 'NORTH DAKOTA',
 'OHIO',
 'OKLAHOMA',
 'OREGON',
 'PENNSYLVANIA',
 'PUERTO RICO',
 'RHODE ISLAND',
 'SOUTH CAROLINA',
 'SOUTH DAKOTA',
 'TENNESSEE',
 'TEXAS',
 'UTAH',
 'VERMONT',
 'VIRGINIA',
 'WASHINGTON',
 'WEST VIRGINIA',
 'WISCONSIN',
 'WYOMING']

] We don’t want to include DC or Puerto Rico.

states = [x for x in states if x!='DISTRICT OF COLUMBIA' and x!='PUERTO RICO']
states

['ALABAMA',
 'ALASKA',
 'ARIZONA',
 'ARKANSAS',
 'CALIFORNIA',
 'COLORADO',
 'CONNECTICUT',
 'DELAWARE',
 'FLORIDA',
 'GEORGIA',
 'HAWAII',
 'IDAHO',
 'ILLINOIS',
 'INDIANA',
 'IOWA',
 'KANSAS',
 'KENTUCKY',
 'LOUISIANA',
 'MAINE',
 'MARYLAND',
 'MASSACHUSETTS',
 'MICHIGAN',
 'MINNESOTA',
 'MISSISSIPPI',
 'MISSOURI',
 'MONTANA',
 'NEBRASKA',
 'NEVADA',
 'NEW HAMPSHIRE',
 'NEW JERSEY',
 'NEW MEXICO',
 'NEW YORK',
 'NORTH CAROLINA',
 'NORTH DAKOTA',
 'OHIO',
 'OKLAHOMA',
 'OREGON',
 'PENNSYLVANIA',
 'RHODE ISLAND',
 'SOUTH CAROLINA',
 'SOUTH DAKOTA',
 'TENNESSEE',
 'TEXAS',
 'UTAH',
 'VERMONT',
 'VIRGINIA',
 'WASHINGTON',
 'WEST VIRGINIA',
 'WISCONSIN',
 'WYOMING']

Finally, we want to pull out the violent crime numbers for the total area of the state. Notice that Puerto Rico and DC use “Total”, not “State Total”, for Area and so they will be excluded.

vcrime2019 = crime2019[crime2019['Area'] == 'State Total']['Violent_crime1'].values
vcrime2019

array([25046, 6343, 33141, 17643, 174331, 21938, 6546, 4115, 81270, 36170,
       4042, 4000, 51561, 24966, 8410, 11968, 9701, 25537, 1548, 27456,
       22578, 43686, 13332, 8272, 30380, 4328, 5821, 15210, 2074, 18375,
       17450, 69764, 38995, 2169, 34269, 17086, 11995, 39228, 2342, 26323,
       3530, 40647, 121474, 7553, 1262, 17753, 22377, 5674, 17070, 1258],
      dtype=object)