For those interested in machine learning, the MNIST database of handwritten digits is the most common starting point for classification. The data can be obtained from many places. I downloaded the data files from the kaggle digit-recognizer "competition". This yields a train.csv file and a test.csv file, as well as a sample_submission.csv file that we won't use.
We'll work with train.csv which is a csv file whose rows correspond to labelled images and whose columns are "label" giving the true value of the image and then 784 columns labelled pixelnnn which contain the 0/1 values of a 28x28 image.
import pandas as pd
import matplotlib.pyplot as plt
mnist = pd.read_csv('../data/MNIST/train.csv')
print('We have {} images'.format(mnist.shape[0]))
Here are the columns.
mnist.columns
This is a little function to draw selections out of the mnist table, just to see how they look.
def show_row(L,ncols=3,size=20):
'''show_row(L,ncols=3,size=20): Display the requested rows (in iterable L) of the mnist dataframe as an image,
using ncols columns, and figsize size.'''
fig, axes = plt.subplots(len(L)//ncols+1,ncols)
fig.set_size_inches(size,size)
fig.tight_layout(pad=1.4)
N = (len(L)//ncols+1)*ncols
for s,i in enumerate(L):
axes[i//ncols,i%ncols].imshow(mnist.iloc[i,1:].values.reshape(28,28),cmap='gray_r')
axes[i//ncols,i%ncols].set_title('This is a/n {}'.format(mnist.iloc[i,0]))
for j in range(s+1,N):
fig.delaxes(axes[j//ncols,j%ncols])
And here are the first 25 images out of the table.
show_row(range(25),ncols=5)