Variance and Principal Component Analysis

In [11]:
import pandas as pd
import numpy as np
from bokeh.plotting import figure
from import show, output_notebook
from bokeh.models import Slope, ColumnDataSource, Arrow, OpenHead, LabelSet
from bokeh.models.glyphs import MultiLine
Loading BokehJS ...

Preliminaries: Samples and Features

Our goal in machine learning problems is to extract meaningful information from a collection of data. In many cases the data is naturally organized into a large number of samples, and, for each sample, a large collection of measurements or features. Depending on whether we are doing classification or regression, we are trying to extract the label or the output value as a function of the features of a sample.

  • In the MNIST data set, each image is a sample and the features are the 784 pixel values for each image.
  • In the Zillow home price algorithm, the samples are houses, the features are the characteristics of the house, and the target to be predicted is the selling price.

In many cases the features are a mix of types -- some are numerical, some quantitative. For example, in the Zillow data, some relevant features might be past selling price and taxes, and others might be whether or not the house has Air Conditioning.

It's good practice to organize your data so that the ROWS are samples (or observations) and the COLUMNS are features (or variables). This is consistent with what's called tidy data. A great deal of software expects your data to be organized in this way.

Sometimes you have to make choices about what are your samples and what are your features. For example, in the movie recommendation problem you have a collection of data of the form (person, movie, rating). To use this data you need to "unbundle" it. You can:

  • create a matrix whose rows correspond to people and whose columns correspond to movies. From this point of view, an "observation" consists of finding a person and asking them their opinion about a bunch of movies.
  • create a matrix whose rows are movies and whose columns are people. From this point of view, an "observation" consists of collecting individual opinions from many people about that particular movie.

Which approach is correct depends on your goal. For us, if you want to recommend movies to people based on the opinions of other people with similar tastes, then your observations are people and your features are the movies that they like.

It seems to me that in the reinforcement learning problem, the features are the many states of the game, plus whose turn it is to move. The samples are the games themselves.

Mean and Variance

Let's look at a simple case where our features are numbers -- say there are $k$ samples and, for each sample, $m$ features that are real numbers. Then our data is a $k\times m$ matrix of real numbers or, alternatively, we have $k$ points $x_1,\ldots, x_k$ in $\mathbf{R}^{m}$.

From a geometric point of view, the mean of our dataset is the centroid of the set of points: $$ \overline{x} = \frac{1}{k}\sum_{i=1}^{k} x_k $$

The variance of our set of points, which measures how dispersed they are around the mean, is $$ \sigma^2 = \frac{1}{k}\sum_{i=1}^{k} \|x_i-\overline{x}\|^2 $$

Since our focus for the rest of this discussion is on the variance, let's simplify things and assume that our points have mean zero. This can be accomplished just be replacing each point $x_i$ by $x_i-\overline{x}$. Then the variance just becomes $$ \sigma^2=\frac{1}{k}\sum_{i=1}^{k}\|x_i\|^2. $$ or the average squared distance of the points from the origin.

Components of the variance

Each point $x_i$ is a vector, so it has $m$ coordinates: $$ x_i = (x_{i1},x_{i2},\ldots, x_{im}). $$

Therefore $$ \|x_i\|^2 = \sum_{j=1}^{m} x_{ij}^2. $$

We can regroup the sum that gives the variance by components and get $$ \sigma^2 = \sum_{j=1}^{m} \sigma_j^2 $$ where, for each $j$, $$ \sigma_{j}^2 = \sum_{i=1}^{k} x_{ij}^2. $$ In other words $\sigma_j^2$ is the variance of the $j^{th}$ coordinates of the points $x_i$ and the total variance is a sum of contributions of these variances in the $m$ standard orthogonal basis directions in $\mathbf{R}^{m}$.

More generally, suppose that $\mathbf{u}$ is a unit vector in $\mathbf{R}^{m}$. We can orthogonally project each of the points $x_i$ into the $\mathbf{u}$ direction and ask for the variance in the $\mathbf{u}$ direction.

In [12]:
x = np.random.normal(0,1.5,size=20)
y = np.random.normal(0,1.5,size=20)
xprime = (x+2*y)/5
yprime = 2*xprime
xs = [[x[i],xprime[i]] for i in range(x.shape[0])]
ys = [[y[i],yprime[i]] for i in range(y.shape[0])]
In [13]:
f = figure(x_range=(-5,5),y_range=(-5,5),title="Orthogonal Projection of Data"),y,legend='data'),yprime,color='black',legend='projection')

Since the projection of the data point $x$ in the $\mathbf{u}$ direction is $(x\cdot \mathbf{u})\mathbf{u}$ the variance in the $\mathbf{u}$ direction is $$ \sigma_{\mathbf{u}}^2 = \sum_{i=1}^{k}\|(x_i\cdot\mathbf{u})\mathbf{u}\|^2=\sum_{i=1}^{k}|(x_i\cdot\mathbf{u})|^2 $$

How should we interpret this? The "direction" given by the vector $\mathbf{u}$ is a "synthetic feature" of the data -- it's a feature that has been derived from the other features, in this case as a (normalized) linear combination of features.

Suppose for concreteness that $m=3$ and that the features of our data are age, weight, and income (normalized so that the means are zero). If our direction $\mathbf{u}$ is $(\frac{1}{\sqrt{3}},\frac{1}{\sqrt{3}},\frac{1}{\sqrt{3}})$ then projection of the data into the $\mathbf{u}$ direction measures the synthetic feature

$$ \frac{(\mathrm{age} + \mathrm{weight} + \mathrm{income})}{\sqrt{3}} $$

and the variance $\sigma^2_{\mathbf{u}}$ measures the variance of this synthetic feature.

We may be able to distinguish people better through such synthetic features than we can through the individual characteristics.

For example, your Credit Score is some kind of synthetic feature constructed out of all of your credit history which is supposed to predict your likelihood of loan repayment. Your BMI (your weight divided by your height squared) is another synthetic feature.

Of course, the $\sigma_u^2$ varies with $u$. We can package up all of the information for varying $u$ by using a little linear algebra. Our vectors $x_i$ are $1\times m$ row vectors, and so is the direction $\mathbf{u}$. Therefore the dot product $x\cdot \mathbf{u}$ can be written $x^t\mathbf{u}$ where $x^t$ is the transpose of $x$.

Putting all of the $x_i$ together into the $k\times m$ matrix $X$, it follows that

$$ \sigma_\mathbf{u}^2 = \mathbf{u}X^tX\mathbf{u}^t. $$

To pull this apart, $\mathbf{u}X^t$ is a $1\times k$ matrix whose entries are the dot products $\mathbf{u}\cdot x_i$ for $i=1,\ldots, k$. Taking the dot product of this $1\times k$ matrix with itself gives the sum of the squares of these dot products, which is what we want.

Note also that this sum of squares is non-negative. It could be zero -- for example, if all of the data happened to lie in a lower dimensional subspace of the features space -- but for random data this is pretty much impossible. It might lie close to a lower dimensional subspace, however.

The matrix $X^{t}X$ is an $m\times m$ symmetric matrix. If we let $D=X^{t}X/k$ then the $m$ diagonal entries of $D$ are the variances along the $m$ coordinate axes that we called $\sigma_j^2$ up above, and its trace is the total variance $\sigma^2$.

Eigenvectors, Eigenvalues, and Variance

Because the matrix $D=X^tX/k$ is a real symmetric matrix we can construct an orthonormal sequence $u_1,\ldots,u_m$ of eigenvectors of $D$ with associated eigenvalues $\lambda_1,\lambda_2,\ldots,\lambda_m$.

Let's order these so that

$$ \lambda_1\ge \lambda_2 \ge \cdots\ge \lambda_m\ge 0. $$

These $u_i$ are called the principal directions. Recall that

$$ \sigma_{u_i}^2 = u_iDu_i^t = \lambda_i $$

so the eigenvalues (the principal components) are the variances in the principal directions.

Proposition: $$ \sigma^2 = \sum_{i=1}^{m} \lambda_i $$

Proof: The total variance $\sigma^2$ is the trace of the matrix $D$.

Corollary: The direction $u_1$ is the direction in which the data has the largest variance (in the sense that the orthogonal projection of the data into the $u_1$ direction has the largest variance). Similarly $u_2$ is the second largest, and so on.

Dimension Reduction

The idea behind using the principal components to reduce the dimension of the data is that if we project our datapoints into the subspace spanned by the first few principal components (meaning the ones associated with largest eigenvalues, and hence the largest share of the variance) then we will capture a significant portion of the variation in our data. The smaller eigenvalues contain less information or might even arise from noise.

Here we carry out analysis of some (very old) data on arrests in the USA. This dataset is distributed with the R language package and it can be downloaded from here. For each state it gives the rates of murder, rape and assault, as well as the percentage of population that lives in urban areas. Note this is from 1975 so don't draw any real conclusions from it!

This analysis follows section 10.2.1 of Introduction to Statistical Learning with R by James, et. al, called ISLR below.

First we load in the data.

In [14]:
arrests = pd.read_csv('data/R/USArrests.csv',index_col=0)

We standardize it so it has mean zero and each column (feature) has standard deviation 1. This latter step mainly helps scale the final results so they are easier to compare.

In [15]:
M = arrests.values
M = (M - np.mean(M,axis=0))
M = M/M.std(axis=0)

We compute the symmetric matrix (called the covariance matrix) and then find it's eigenvalues (L) and eigenvectors (the columns of the matrix V)

In [16]:
D =,M)/M.shape[0]
L, V = np.linalg.eigh(D)

We will focus our attention on the space spanned by the last two columns of V which correspond to the two largest eigenvalues -- these are the "principal components". We project the data into the space spanned by those vectors.

In [17]:
P =,V[:,2:])

We add the coordinates of the PC's into the dataframe.

In [18]:
arrests['PC1'] = P[:,1]
arrests['PC2'] = P[:,0]
arrests['State'] = arrests.index

Each of the original variables corresponds to a standard basis vector in the original feature space. We project those original variables into the space spanned by PC1 and PC2 as well, and standardize them. This will allow us to see how moving around in the PC space is related to changes in the original variables.

In [19]:
murder = V[0,2:]/np.linalg.norm(V[0,2:])
assault = V[1,2:]/np.linalg.norm(V[1,2:])
rape = V[3,2:]/np.linalg.norm(V[3,2:])
pop = V[2,2:]/np.linalg.norm(V[2,2:])
In [20]:
f = figure(tooltips=[("State","@index"),("Murder","@Murder"),("Assault","@Assault"),("Rape","@Rape"),("UrbanPct","@UrbanPop")],y_range=(-3,3),title='PCA on Arrest Data:orange=rape, blue=assault, green=murder'),x='PC1',y='PC2')

In the diagram above (which mimics Figure 10.1 in ISLR), we see that the red arrow points in the direction in which the original "UrbanPop" is projected. So moving DOWN in the graph corresponds to increasing percentage of urban population. Rural states like Mississippi, Vermont, and West Virginia are up above; Urban states like California, New Jersey, Massachusetts are down below.

The three crime arrows point to the left, with slighly different orientations. This tells us that moving to the left gives generally higher crime rates, though the different crimes interact differently with population.

So we can think of PC1 as a good amalgamation of the overall crime measure and PC2 as measuring urban population.

In [ ]: