1 Principal Component Analysis

1.1 Introduction

Suppose that, as usual, we begin with a collection of measurements of different features for a group of samples. Some of these measurements will tell us quite a bit about the difference among our samples, while others may contain relatively little information. For example, if we are analyzing the effect of a certain weight loss regimen on a group of people, the age and weight of the subjects may have a great deal of influence on how successful the regimen is, while their blood pressure might not. One way to help identify which features are more significant is to ask whether or not the feature varies a lot among the different samples. If nearly all the measurements of a feature are the same, it can’t have much power in distinguishing the samples, while if the measurements vary a great deal then that feature has a chance to contain useful information.

In this section we will discuss a way to measure the variability of measurements and then introduce principal component analysis (PCA). PCA is a method for finding which linear combinations of measurements have the greatest variability and therefore might contain the most information. It also allows us to identify combinations of measurements that don’t vary much at all. Combining this information, we can sometimes replace our original system of features with a smaller set that still captures most of the interesting information in our data, and thereby find hidden characteristics of the data and simplify our analysis a great deal.

1.2 Variance and Covariance

1.2.1 Variance

Suppose that we have a collection of measurements (x1,,xn) of a particular feature X. For example, xi might be the initial weight of the ith participant in our weight loss study. The mean of the values (x1,,xn) is

μX=1ni=1nxi.

The simplest measure of the variability of the data is called its variance.

Definition: The (sample) variance of the data x1,,xn is

σX2=1ni=1n(xiμX)2=1n(i=1nxi2)μX2(1)

The square root of the variance is called the standard deviation.

As we see from the formula, the variance is a measure of how ‘spread out’ the data is from the mean.

Recall that in our discussion of linear regression we thought of our set of measurements x1,,xn as a vector – it’s one of the columns of our data matrix. From that point of view, the variance has a geometric interpretation – it is 1N times the square of the distance from the point X=(x1,,xn) to the point μX(1,1,,1)=μXE:

σX2=1n(XμXE)(XμXE)=1nXμXE2.(2)

1.2.2 Covariance

The variance measures the dispersion of measures of a single feature. Often, we have measurements of multiple features and we might want to know something about how two features are related. The covariance is a measure of whether two features tend to be related, in the sense that when one increases, the other one increases; or when one increases, the other one decreases.

Definition: Given measurements (x1,,xn) and (y1,,yn) of two features X and Y, the covariance of X and Y is

σXY=1Ni=1N(xiμX)(yiμY)(3)

There is a nice geometric interpretation of this, as well, in terms of the dot product. If X=(x1,,xn) and Y=(y1,yn) then

σXY=1N((XμXE)(YμYE)).

From this point of view, we can see that σXY is positive if the XμXE and YμYE vectors “point roughly in the same direction” and its negative if they “point roughly in the opposite direction.”

1.2.3 Correlation

One problem with interpreting the variance and covariance is that we don’t have a scale – for example, if σXY is large and positive, then we’d like to say that X and Y are closely related, but it could be just that the entries of XμXE and YμYE are large. Here, though, we can really take advantage of the geometric interpretation. Recall that the dot product of two vectors satisfies the formula

ab=abcos(θ)

where θ is the angle between a and b. So

cos(θ)=abab.

Let’s apply this to the variance and covariance, by noticing that

(XμXE)(YμYE)(XμXE)(YμYE)=σXYσXXσYY

so the quantity

rXY=σXYσXσY(4)

measures the cosine of the angle between the vectors XμXE and YμYE.

Definition: The quantity rXY defined in eq. 4 is called the (sample) correlation coefficient between X and Y. We have 0|rXY|1 with rXY=±1 if and only if the two vectors XμX and YμY are collinear in Rn.

Figure 1 illustrates data with different values of the correlation coefficient.

Figure 1: Correlation

1.2.4 The covariance matrix

In a typical situation we have many features for each of our (many) samples, that we organize into a data matrix X. To recall, each column of X corresponds to a feature that we measure, and each row corresponds to a sample. For example, each row of our matrix might correspond to a person enrolled in a study, and the columns correspond to height (cm), weight (kg), systolic blood pressure, and age (in years):

Table 1: A sample data matrix X
sample Ht Wgt Bp Age
A 180 75 110 35
B 193 80 130 40
U 150 92 105 55

If we have multiple features, as in this example, we might be interested in the variance of each feature and all of their mutual covariances. This “package” of information can be obtained “all at once” by taking advantage of some matrix algebra.

Definition: Let X be a N×k data matrix, where the k columns of X correspond to different features and the N rows to different samples. Let X0 be the centered version of this data matrix, obtained by subtracting the mean μi of column i from all the entries xsi in that column. Then the k×k symmetric matrix

D0=1NX0X0

is called the (sample) covariance matrix for the data.

Proposition: The diagonal entries dii of D0 are the variances of the columns of X:

dii=σi2=1Ns=1N(xsiμi)2

and the off-diagonal entries dij=dji are the covariances of the ith and jth columns of X:

dij=σij=1Ns=1N(xsiμi)(xsjμj)

The sum of the diagonal entries, the trace of D0 is the total variance of the data.

Proof: This follows from the definitions, but it’s worth checking the details, which we leave as an exercise.

1.2.5 Visualizing the covariance matrix

If the number of features in the data is not too large, a density matrix plot provides a tool for visualizing the covariance matrix of the data. A density matrix plot is an k×k grid of plots (where k is the number of features). The entry with (i,j) coordinates in the grid is a scatter plot of the ith feature against the jth one if ij, and is a histogram of the ith variable if i=j.

Figure 2 is an example of a density matrix plot for a dataset with 50 samples and 2 features. This data has been centered, so it can be represented in a 50×2 data matrix X0. The upper left and lower right graphs are scatter plots of the two columns, while the lower left and upper right are the histograms of the columns.

Figure 2: Density Matrix Plot

1.2.6 Linear Combinations of Features (Scores)

Sometimes useful information about our data can be revealed if we combine different measurements together to obtain a “hybrid” measure that captures something interesting. For example, in the Auto MPG dataset that we studied in the section on Linear Regression, we looked at the influence of both vehicle weight w and engine displacement e on gas mileage; perhaps their is some value in considering a hybrid “score” defined as S=aw+be for some constants a and b – maybe by choosing a good combination we could find a better predictor of gas mileage than using one or the other of the features individually.

As another example, suppose we are interested in the impact of the nutritional content of food on weight gain in a study. We know that both calorie content and the level dietary fiber contribute to the weight gain of participants eating this particular food; maybe there is some kind of combined “calorie/fiber” score we could introduce that captures the impact of that food better.

Definition: Let X0 be a (centered) N×k data matrix giving information about k features for each of N samples. A linear synthetic feature, or a linear score, is a linear combination of the k features. The linear score is defined by constants a1,,ak so that If y1,,yk are the values of the features for a particular sample, then the linear score for that sample is

S=a1y1+a2y2++akyk

Lemma: The values of the linear score for each of the N samples can be calculated as

[S1SN]=X0[a1ak].(5)

Proof: Multiplying a matrix by a column vector computes a linear combination of the columns – that’s what this lemma says. Exercise 3 asks you to write out the indices and make sure you believe this.

1.2.7 Mean and variance of scores

When we combine features to make a hybrid score, we assume that the features were centered to begin with, so that each features has mean zero. As a result, the mean of the hybrid features is again zero.

Lemma: A linear combination of features with mean zero again has mean zero.

Proof: Let Si be the score for the ith sample, so Si=j=1kxijaj. where X0 has entries xij. Then the mean value of the score is μS=1ki=1NSi=1Ni=1Nj=1kxijaj. Reversing the order of the sum yields μS=1Nj=1ki=1Nxijaj=j=1kaj1N(i=1Nxij)=j=1kajμj=0 where μj=0 is the mean of the jth feature (column) of X0.

The variance is more interesting, and gives us an opportunity to put the covariance matrix to work. Remember from 2 that, since a score S has mean zero, it’s variance is σS2=1NSS – where here the score S is represented by the column vector with entries S1,Sk as in eq. 5.

Lemma: The variance of the score S with weights a1,ak is σS2=aD0a=[a1ak]D0[a1ak](6) More generally, if S1 and S2 are scores with weights a1,,ak and b1,,bk respectively, then the covariance σS1S2 is σS1S2=aD0b.

Proof: From eq. 2 and 5 we know that σS2=1NSS and S=X0a. Since 1NSS=1NSS, this gives us 1NσS2=1N(X0a)(X0a)=1NaX0X0a=aD0a as claimed.

For the covariance, use a similar argument with eq. 3 and eq. 5. writing σS1S2=1NS1S2 and the fact that S1 and S2 can be written as X0a and X0b.

The point of this lemma is that the covariance matrix contains not just the variances and covariances of the original features, but also enough information to construct the variances and covariances for any linear combination of features.

In the next section we will see how to exploit this idea to reveal hidden structure in our data.

1.2.8 Geometry of Scores

Let’s return to the dataset that we looked at in section 1.2.5. We simplify the density matrix plot in fig. 3, which shows one of the scatter plots and the two histograms.

The scatter plot shows that the data points are arranged in a more or less elliptical cloud oriented at an angle to the xy-axes which represent the two given features. The two individual histograms show the distribution of the two features – each has mean zero, with the x-features distributed between 2 and 2 and the y feature between 4 and 4. Looking just at the two features individually, meaning only at the two histograms, we can’t see the overall elliptical structure.

Figure 3: Simulated Data with Two Features

How can we get a better grip on our data in this situation? We can try to find a “direction” in our data that better illuminates the variation of the data. For example, suppose that we pick a unit vector at the origin pointing in a particular direction in our data. See fig. 4.

Figure 4: A direction in the data

Now we can orthogonally project the datapoints onto the line defined by this vector, as shown in fig. 5.

Figure 5: Projecting the datapoints

Recall that if the unit vector is defined by coordinates u=[u0,u1], then the orthogonal projection of the point x with coordinates (x0,x1) is (xu)u. Now xu=u0x0+u1x1 so the coordinates of the points along the line defined by u are the values of the score Z defined by u=[u0,u1]. Using our work in the previous section, we see that we can find all of these coordinates by matrix multiplication: Z=X0u where X0 is our data matrix. Now let’s add a histogram of the values of Z to our picture:

Figure 6: Distribution of Z

This histogram shows the distribution of the values of Z along the tilted line defined by the unit vector u.

Finally, using our work on the covariance matrix, we see that the variance of Z is given by σZ2=150uX0X0u=uD0u where D0 is the covariance matrix of the data X0.

Lemma: Let X0 be a N×k centered data matrix, and let D0=1NX0X0 be the associated covariance matrix. Let u be a unit vector in “feature space” Rk. Then the score S=X0u can be interpreted as the coordinates of the points of X0 projected onto the line generated by u. The variance of this score is σS2=uD0u=i=1Nsi2 where si=X0[i,:]u is the dot product of the ith row X0[i,:] with u. It measures the variability in the data “in the direction of the unit vector u.”

1.3 Principal Components

1.3.1 Change of variance with direction

As we’ve seen in the previous section, if we choose a unit vector u in the feature space and find the projection X0u of our data onto the line through u, we get a “score” that we can use to measure the variance of the data in the direction of u. What happens as we vary u?

To study this question, let’s continue with our simulated data from the previous section, and introduce a unit vector u(θ)=[cos(θ)sin(θ)]. This is in fact a unit vector, since sin2(θ)+cos2(θ)=1, and it is oriented at an angle θ from the x-axis.

The variance of the data in the direction of u(θ) is given by σθ2=u(θ)D0u(θ).

A plot of this function for the data we have been considering is in fig. 7. As you can see, the variance goes through two full periods with the angle, and it reaches a maximum and minimum value at intervals of π/2 – so the two angles where the variance are maximum and minimum are orthogonal to one another.

Figure 7: Change of variance with angle theta

The two directions where the variance is maximum and minimum are drawn on the original data scatter plot in fig. 8 .

Figure 8: Data with principal directions

Let’s try to understand why this is happening.

1.3.2 Directions of extremal variance

Given our centered, N×i data matrix X0, with its associated covariance matrix D0=1NX0X0, we would like to find unit vectors u in Rk so that σu2=uD0u reaches its maximum and its minimum. Here σu2 is the variance of the “linear score” X0u and it represents how dispersed the data is in the “u direction” in Rk.

In this problem, remember that the coordinates of u=(u1,,uk) are the variables and the symmetric matrix D0 is given. As usual, we to find the maximum and minimum values of σu2, we should look at the partial derivatives of σu2 with respect to the variables ui and set them to zero. Here, however, there is a catch – we want to restrict u to being a unit vector, with uu=ui2=1.

So this is a constrained optimization problem:

We will use the technique of Lagrange Multipliers to solve such a problem.

To apply this method, we introduce the function

S(u,λ)=uD0uλ(uu1)(7)

Then we compute the gradient

S=[Su1SukSλ](8)

and solve the system of equations S=0. Here we have written the gradient as a column vector for reasons that will become clearer shortly.

Computing all of these partial derivatives looks messy, but actually if we take advantage of matrix algebra it’s not too bad. The following two lemmas explain how to do this.

Lemma: Let M be a N×k matrix with constant coefficients and let u be a k×1 column vector whose entries are u1,uk. The function F(u)=Mu is a linear map from RkRN. Its (total) derivative is a linear map between the same vector spaces, and satisfies D(F)(v)=Mv for any k×1 vector v. If u is a 1×N matrix, and G(u)=uM, then D(G)(v)=vM

for any 1×N vector v. (This is the matrix version of the derivative rule that ddx(ax)=a for a constant a.)

Proof: Since F:RkRN, we can write out F in more traditional function notation as F(u)=(F1(u1,,uk),,FN(u1,,uk) where Fi(u1,uk)=j=1kmijuj. Thus Fiuj=mij. The total derivative D(F) is the linear map with matrix D(F)ij=Fiuj=mij and so D(F)=M.

The other result is proved the same way.

Lemma: Let D be a symmetric k×k matrix with constant entries and let u be an k×1 column vector of variables u1,,uk. Let F:RkR be the function F(u)=uDu. Then the gradient uF is a vector field – that is, a vector-valued function of u, and is given by the formula uF=2Du

Proof: Let dij be the i,j entry of D. We can write out the function F to obtain F(u1,,uk)=i=1kj=1kuidijuj. Now Fui is going to pick out only terms where ui appears, yielding: Fui=j=1kdijuj+j=1kujdji Here the first sum catches all of the terms where the first “u” is ui; and the second sum catches all the terms where the second “u” is ui. The diagonal terms ui2dii contribute once to each sum, which is consistent with the rule that the derivative of ui2dii=2uidii. To finish the proof, notice that j=1kujdji=j=1kdijuj since D is symmetric, so in fact the two terms are the same Thus uiF=2j=1kdijuj But the right hand side of this equation is twice the ith entry of Du, so putting the results together we get uF=[Fu1Fuk]=2Du.

The following theorem puts all of this work together to reduce our questions about how variance changes with direction.

1.3.3 Critical values of the variance

Theorem: The critical values of the variance σu2, as u varies over unit vectors in RN, are the eigenvalues λ1,,λk of the covariance matrix D, and if ei is a unit eigenvector corresponding to λi, then σei2=λi.

Proof: Recall that we introduced the Lagrange function S(u,λ), whose critical points give us the solutions to our constrained optimization problem. As we said in eq. 7: S(u,λ)=uD0uλ(uu1)=uD0uλ(uu)+λ Now apply our Matrix calculus lemmas. First, let’s treat λ as a constant and focus on the u variables. We can write uu=uINu where IN is the identity matrix to compute: uS=2D0u2λu For λ we have λS=uu+1. The critical points occur when uS=2(D0λ)u=0 and λS=1uu=0 The first equation says that λ must be an eigenvalue, and u an eigenvector: D0u=λu while the second says u must be a unit vector uu=u2=1. The second part of the result follows from the fact that if ei is a unit eigenvector with eigenvalue λi then σei2=eiD0ei=λiei2=λi.

To really make this result pay off, we need to recall some key facts about the eigenvalues and eigenvectors of symmetric matrices. Because these facts are so central to this result, and to other applications throughout machine learning and mathematics generally, we provide proofs in section 1.5.


Table 2: Properties of Eigenvalues of Real Symmetric Matrices
Summary
1. All of the eigenvalues λ1,,λl of D are real. If uDu0 for all uRk, then all eigenvalues λi are non-negative. In the latter case we say that D is positive semi-definite.
2. If v is an eigenvector for D with eigenvalue λ, and w is an eigenvector with a different eigenvalue λ, then v and w are orthogonal: vw=0.
3. There is an orthonormal basis u1,,uk of Rk made up of eigenvectors of D corresponding to the eigenvalues λi.
4. Let Λ be the diagonal matrix with entries λ1,,λN and let P be the matrix whose columns are made up of the vectors ui. Then D=PΛP.

If we combine this theorem with the facts summarized in table 2 then we get a complete picture. Let D0 be the covariance matrix of our data. Since σu2=uD0u0(it’s a sum of squares) we know that the eigenvalues λ1λ2λk0 are all nonnegative. Choose a corresponding sequence u1,uk of orthogonal eigenvectors where all ui2=1. Since the ui form a basis of RN, any score is a linear combination of the ui: S=i=1kaiui. Since uiD0uj=λjuiuj=0 unless i=j, in which case it is λi, we can compute σS2=i=1kλiai2, and S2=i=1kai2 since the ui are an orthonormal set. So in these coordinates, our optimization problem is:

We don’t need any fancy math to see that the maximum happens when a1=1 and the other aj=0, and in that case, the maximum is λ1. (If λ1 occurs more than once, there may be a whole subspace of directions where the variance is maximal). Similarly, the minimum value is λk and occurs when ak=1 and the others are zero.

1.3.4 Subspaces of extremal variance

We can generalize the idea of the variance of our data in a particular direction to a higher dimensional version of total variance in a subspace. Suppose that E is a subspace of Rk and U is a matrix whose columns span E – the columns of U are the weights of a family of scores that span E. The values of these scores are XU and the covariance matrix of this projected data is 1NUXXU=UD0U.
Finally, the total variance σE2 of the data projected into E is the sum of the diagonal entries of the matrix

σE2=trace(UD0U)

Just as the variance in a given direction u depends on the scaling of u, the variance in a subspace depends on the scaling of the columns of U. To normalize this scaling, we assume that the columns of U are an orthonormal basis of the subspace E.

Now we can generalize the question asked in section 1.3.2 by seeking, not just a vector u pointing in the direction of the extremal variance, but instead the subspace Us of dimension s with the property that the total variance of the projection of the data into Us is maximal compared to its projection into other subspaces of that dimension.

Theorem: Let u1,,uk be the eigenvectors of D0 with eigenvalues λ1λ2λk0. Given 1sk, choose s of the largest eigenvalues of D0 and let U be the matrix whose columns are the corresponding eigenvectors. (There may be several choices here, since some of the eigenvalues may be the same.) The subspaces constructed in this way have the largest total variance among all subspaces of dimension s, and that total variance is i=1sλi. These are called subspaces of extremal variance.

Overlooking the issue of repeated eigenvalues, this can be phrased more concisely by saying that the eigenvectors corresponding to the s largest eigenvalues span the subspace of dimension s such that the projection of the data into that subspace has maximal variance among all subspaces of dimension s.

Proof: To make this concrete, suppose we consider a subspace E of Rk of dimension t with basis w1,,wt. Complete this to a basis w1,,wt,wt+1,,wk of Rk and then apply the Gram Schmidt Process (see section 1.5.1) to find an orthonormal basis w1,,ws,ws+1,,wk where the w1,,wt are an orthonormal basis for E. Let W be the k×t matrix whose columns are the wi for i=1,,t. The rows of the matrix X0W given the coordinates of the projection of each sample into the subspace E expressed in terms of the scores corresponding to these vectors wi. The total variance of these projections is

σE2=i=1tX0wi2=i=1t(wi)X0X0wi=i=1t(wi)D0wi

If we want to maximize this, we have the constrained optimization problem of finding w1,,wt so that

Then the span E of these wi is subspace of extremal variance.

Theorem: A t-dimensional subspace E is a subspace of extremal variance if and only if it is spanned by t orthonormal eigenvectors of the matrix D0 corresponding to the t largest eigenvalues for D0.

Proof: We can approach this problem using Lagrange multipliers and matrix calculus if we are careful. Our unknown is k×t matrix W whose columns are the t (unknown) vectors wi. The objective function that we are seeking to maximize is F=trace(WD0W)=i=1t(wi)D0wi. The constraints are the requirements that wi2=1 and wiwj=0 if ij. If we introduction a matrix of lagrange multipliers Λ=(λij), where λij is the multiplier that goes with the the first of these constraints when i=j, and the second when ij, we can express our Lagrange function as: S(W,Λ)=trace(WD0W)(WWI)Λ where I is the t×t identity matrix.

Taking the derivatives with respect to the entries of W and of Λ yields the following two equations: D0W=WΛWW=I

The first of these equations says that the space E spanned by the columns of W is invariant under D0, while the second says that the columns of W form an orthonormal basis.

Let’s assume for the moment that we have a matrix W that satisfies these conditions.
Then it must be the case that Λ is a symmetric, real valued t×t matrix, since WD0W=WWΛ=Λ. and the matrix on the left is symmetric.

By the properties of real symmetric matrices (the spectral theorem), there are orthonormal vectors q1,qt that are eigenvectors of Λ with corresponding eigenvalues τi. If we let Q be the matrix whose columns are the vectors qi and let T be the diagonal t×t matrix whose entries are the τi, we have ΛQ=QT.

If we go back to our original equations, we see that if W exists such that DW=WΛ, then there is a matrix Q with orthonormal columns and a diagonal matrix T such that D0WQ=WΛQ=WQT. In other words, WQ is a matrix whose columns are eigenvectors of D0 with eigenvalues τi for i=1,,t.

Thus we see how to construct an invariant subspace E and a solution matrix W. Such an E is spanned by t orthonormal eigenvectors qi with eigenvalues τi of D0; and W is is the matrix whose columns are the qi. Further, in that case, the total variance associated to E is the sum of the eigenvalues τi; to make this as large as possible, we should choose our eigenvectors to correspond to t of the largest eigenvalues of D0. This concludes the proof.

1.3.5 Definition of Principal Components

Definition: The orthonormal unit eigenvectors ui for D0 are the principal directions or principal components for the data X0.

Theorem: The maximum variance occurs in the principal direction(s) associated to the largest eigenvalue, and the minimum variance in the principal direction(s) associated with the smallest one. The covariance between scores in principal directions associatedwith different eigenvalues is zero.

At this point, the picture in fig. 8 makes sense – the red and green dashed lines are the principal directions, they are orthogonal to one another, and the point in the directions where the data is most (and least) “spread out.”

Proof: The statement about the largest and smallest eigenvalues is proved at the very end of the last section. The covariance of two scores corresponding to different eigenvectors ui and uj is uiD0uj=λj(uiuj)=0 since the ui and uj are orthogonal.

Sometimes the results above are presented in a slightly different form, and may be referred to, in part, as Rayleigh’s theorem.

Corollary: (Rayleigh’s Theorem) Let D be a real symmetric matrix and let H(v)=maxv0vDvvv. Then H(v) is the largest eigenvalue of D. (Similarly, if we replace max by min, then the minimum is the least eigenvalue).

Proof: The maximum of the function H(v) is the solution to the same optimization problem that we considered above.

Exercises.

  1. Prove that the two expressions for σX2 given in section 1.2.1 are the same.

  2. Prove that the covariance matrix is as described in the proposition in 1.2.4.

  3. Let X0 be a k×N matrix with entries xij for 1ik and 1jN. If a linear score is defined by the constants a1,aN, check that equation eq. 5 holds as claimed.

  4. Why is it important to use a unit vector when computing the variance of X0 in the direction of u? Suppose v=λu where u is a unit vector and λ>0 is a constant. Let S be the score X0v. How is the variance of S related to that of S=X0u?

1.4 Dimensionality Reduction via Principal Components

The principal components associated with a dataset separate out directions in the feature space in which the data is most (or least) variable. One of the main applications of this information is to enable us to take data with a great many features – a set of points in a high dimensional space – and, by focusing our attention on the scores corresponding to the principal directions, capture most of the information in the data in a much lower dimensional setting.

To illustrate how this is done, let X be a N×k data matrix, let X0 be its centered version, and let D0=1NX0X be the associated covariance matrix.

Apply the spectral theorem (described in table 2) and proved in section 1.5 to the covariance matrix to obtain eigenvalues λ1λ2λk0 and associated eigenvectors u1,,uk. The scores Si=X0ui give the values of the data in the principal directions. The variance of Si is λi.

Now choose a number t<k and consider the vectors S1,,St. The jth entry in Si is the value of the score Si for the jth data point. Because S1,,St capture the most significant variability in the original data, we can learn a lot about our data by considering just these t features of the data, instead of needing all N.

To illustrate, let’s look at an example. We begin with a synthetic dataset X0 which has 200 samples and 15 features. The data (some of it) for some of the samples is shown in table 3.

Table 3: Simulated Data for PCA Analysis
f-0 f-1 f-2 f-3 f-4 ... f-10 f-11 f-12 f-13 f-14
s-0 1.18 -0.41 2.02 0.44 2.24 ... 0.32 0.95 0.88 1.10 0.89
s-1 0.74 0.58 1.54 0.23 2.05 ... 0.99 1.14 1.56 0.99 0.59
... ... ... ... ... ... ... ... ... ... ... ...
s-198 1.04 2.02 1.44 0.40 1.33 ... 0.62 0.62 0.54 1.96 0.04
s-199 0.92 2.09 1.58 1.19 1.17 ... 0.42 0.85 0.83 2.22 0.90

The full dataset is a 200×15 matrix; it has 3000 numbers in it and we’re not really equipped to make sense of it. We could try some graphing – for example, fig. 9 shows a scatter plot of two of the features plotted against each other.

Figure 9: Scatter Plot of Two Features

Unfortunately there’s not much to see in fig. 9 – just a blob – because the individual features of the data don’t tell us much in isolation, whatever structure there is in this data arises out of the relationship between different features.

In fig. 10 we show a “density grid” plot of the data. The graph in position i,j shows a scatter plot of the ith and jth columns of the data, except in the diagonal positions, where in position i,i we plot a histogram of column i. There’s not much structure visible; it is a lot of blobs.

Figure 10: Density Grid Plot of All Features

So let’s apply the theory of principal components. We use a software package to compute the eigenvalues and eigenvectors of the matrix D0. The 15 eigenvalues λ1λ15 are plotted, in descending order, in fig. 11 .

Figure 11: Eigenvalues of the Covariance Matrix

This plot shows that the first 4 eigenvalues are relatively large, while the remaining 11 are smaller and not much different from each other. We interpret this as saying that most of the variation in the data is accounted for by the first four principal components. We can even make this quantitative. The total variance of the data is the sum of the eigenvalues of the covariance matrix – the trace of D0 – and in this example that sum is around 5. The sum of the first 4 eigenvalues is about 4, so the first four eignvalues account for about 4/5 of the total variance, or about 80% of the variation of the data.

Now let’s focus in on the two largest eigenvalues λ1 and λ2 and their corresponding eigenvectors u1 and u2. The 200×1 column vectors S1=X0u1 and S2=X0u2 are the values of the scores associated with these two eigenvectors. So for each data point (each row of X0) we have two values (the corresponding entries of S1 and S2.) In fig. 12 we show a scatter plot of these scores.

Figure 12: Scatter Plot of Scores in the First Two Principal Directions

Notice that suddenly some structure emerges in our data! We can see that the 200 points are separated into five clusters, distinguished by the values of their scores! This ability to find hidden structure in complicated data, is one of the most important applications of principal components.

If we were dealing with real data, we would now want to investigate the different groups of points to see if we can understand what characteristics the principal components have identified.

1.4.1 Loadings

There’s one last piece of the PCA puzzle that we are going to investigate. In fig. 12, we plotted our data points in the coordinates given by the first two principal components. In geometric terms, we took the cloud of 200 points in R15 given by the rows of X0 and projected those points into the two dimensional plane spanned by the eigenvectors u1 and u2, and then plotted the distribution of the points in that plane.

More generally, suppose we take our dataset X0 and consider the first t principal components corresponding to the eigenvectors u1,,ut. The projection of the data into the space spanned by these eigenvectors is the represented by the S=k×t matrix X0U where U is the k×t matrix whose columns are the eigenvectors ui. Each row of S gives the values of the score arising from ui in the ith column for i=1,,t.

The remaining question that we wish to consider is: how can we see some evidence of the original features in subspace? We can answer this by imagining that we had an artificial sample x that has a measurement of 1 for the ith feature and a measurement of zero for all the other features. The corresponding point is represented by a 1×k row vector with a 1 in position i. The projection of this synthetic sample into the span of the first t principal components is the 1×t vector xU. Notice, however, that xU is just the ith row of the matrix U. This vector in the space spanned by the ui is called the “loading” of the ith feature in the principal components.

This is illustrated in section 1.4.1, which shows a line along the direction of the loading corresponding to the each feature added to the scatter plot of the data in the plane spanned by the first two principal components. One observation one can make is that some of the features are more “left to right,” like features 7 and 8, while others are more “top to bottom,” like 6. So points that lie on the left side of the plot have smaller values of features 7 and 8, while those at the top of the plot have larger values of feature 6.

Figure 13: Loadings in the Principal Component Plane

In the next, and last, section, of this discussion of Principal Component Analysis, we will give proofs of the key mathematical ideas summarized earlier in table 2, which have been central to this analysis.

1.4.2 The singular value decomposition

The singular value decomposition is a slightly different way of looking at principal components. Let Λ be the diagonal matrix of eigenvalues of D0; we know that the entries of D0 are non-negative. Let’s drop the eigenvectors corresponding to the zero eigenvalue. Let’s say that there are s non-zero eigenvalues, and s corresponding eigenvectors.

Lemma: Let P be the N×s matrix whose columns are the eigenvectors with non-zero eigenvalues, and let Λ+ be the s×s diagonal matrix whose entries are the non-zero eigenvalues. Then PΛ+P=PΛP=D0.

Proof: First observe that PΛ+P is in fact an N×N matrix. Then look at the block structure to verify the result.

The matrix Λ+ is diagonal, invertible, and, since the eigenvalues are positive, it makes sense to consider the real matrix Λ+1/2 whose entries are the square roots of the eigenvalues.

Let U=X0PΛ+1/2. Note that U is a k×s dimensional matrix.

Lemma: The columns of U are orthonormal.

Proof: Compute the s×s matrix UU, whose entries are all of the dot products of the columns of U: UU=Λ+1/2PX0X0PΛ+1/2=Λ+1/2PPΛ+PPΛ+1/2=Is by the previous lemma and the fact that PP is the s×s identity matrix.

Rearranging this yields the singular value decomposition.

Theorem: (The singular value decomposition) The matrix X0 has a decomposition: X0=UΛ+1/2P where U (of dimension k×s) and P (of dimension N×s) are orthogonal, and Λ+ (of dimension s×s) is diagonal with positive entries. Furthermore, the entries of Λ+ are the non-negative eigenvalues of D0=X0X0, and U and P are uniquely determined by X0.

Proof: We won’t work through all of this, as it is a reinterpretation of our work on principal components.

Remark: The entries of Λ+1/2 are called the singular values of X0. They can be found directly by considering the optimization problem implicitly equivalent to the problem we solved in section 1.3.4.

1.5 Eigenvalues and Eigenvectors of Real Symmetric Matrices (The Spectral Theorem)

Now that we’ve shown how to apply the theory of eigenvalues and eigenvectors of symmetric matrices to extract principal directions from data, and to use those principal directions to find structure, we will give a proof of the properties that we summarized in table 2.

A key tool in the proof is the Gram-Schmidt orthogonalization process.

1.5.1 Gram-Schmidt

Proposition (Gram-Schmidt Process): Let w1,,wk be a collection of linearly independent vectors in RN and let W be the span of the wi. Let u1=w1 and let ui=wij=1i1wiujujujuj for i=2,,k. Then

Proof: This is an inductive exercise, and we leave it to you to work out the details.

1.5.2 The spectral theorem

Theorem: Let D be a real symmetric N×N matrix. Then:

  1. All of the N eigenvalues λ1λ2λN are real. If uDu0 for all uRN, then all eigenvalues λi0.
  2. The matrix D is diagonalizable – that is, it has N linearly independent eigenvectors.
  3. If v and w are eigenvectors corresponding to eigenvalues λ and λ, with λλ, then v and w are orthogonal: vw=0.
  4. There is an orthonormal basis u1,,uN of RN made up of eigenvectors for the eigenvalues λi.
  5. Let Λ be the diagonal matrix with entries λ1,,λN and let P be the matrix whose columns are made up of the eigenvectors ui. Then D=PΛP.

Proof: Start with 1. Suppose that λ is an eigenvalue of D. Let u be a corresponding nonzero eigenvector. Then Du=λu and Du=λu, where u is the vector whose entries are the conjugates of the entries of u (and D=D since D is real). Now we have uDu=λuu=λu2 and uDu=λuu=λu2. But the left hand side of both of these equations are the same (take the transpose and use the symmetry of D) so we must have λu2=λu2 so λ=λ, meaning λ is real.

If we have the additional property that uDu0 for all u, then in particular uiDui=λu20, and since u2>0 we must have λ0.

Property 2 is in some ways the most critical fact. We know from the general theory of the characteristic polynomial, and the fundamental theorem of algebra, that D has N complex eigenvalues, although some may be repeated. However, it may not be the case that D has N linearly independent eigenvectors – it may not be diagonalizable. So we will establish that now.

A one-by-one matrix is automatically symmetric and diagonalizable. In the N-dimensional case, we know, at least, that D has at least one eigenvector, and real one at that by part 1, and this gives us a place to begin an inductive argument.

Let vN0 be an eigenvector with eigenvalue λ and normalized so that vN2=1,
and extend this to a basis v1,vN of RN. Apply the Gram-Schmidt process to construct an orthonormal basis of RN u1,,uN so that uN=vN.

Any vector vRN is a linear combination v=i=1Naiui and, since the ui are orthonormal, the coefficients can be calculated as ai=(uiv).

Using this, we can find the matrix D of the linear map defined by our original matrix D in this new basis. By definition, if dij are the entries of D, then

Dui=j=1Ndijuj

and so

dij=ujDui=ujDui.

Since D is symmetric, ujDui=uiDuj and so dij=dji. In other words, the matrix D is still symmetric. Furthermore,

dNi=uiDuN=uiλuN=λ(uiuN)

since uN=vN. Since the ui are an orthonormal basis, we see that diN=0 unless i=N, and dNN=λ.

In other words, the matrix D has a block form: D=(00000λ) and the block denoted by ’s is symmetric. If we call that block D, the inductive hypothesis tells us that the symmetric matrix D is diagonalizable, so it has a basis of eigenvectors u1,,uN1 with eigenvalues λ1,,λN1; this gives us a basis for the subspace of RN spanned by u1,,uN1 which, together with uN gives us a basis of RN consisting of eigenvectors of D.

This finishes the proof of Property 2.

For property 3, compute vDw=λ(vw)=wDv=λ(wv). Since λλ, we must have vw=0.

For property 4, if the eigenvalues are all distinct, this is a consequence of property 2 – you have N eigenvectors, scaled to length 1, for different eigenvalues, and by 2 they are orthogonal. So the only complication is the case where some eigenvalues are repeated. If λ occurs r times, then you have r linearly independent vectors u1,,ur that span the λ eigenspace. The Gram-Schmidt process allows you to construct an orthonormal set that spans this eigenspace, and while this orthonormal set isn’t unique, any one of them will do.

For property 5, let ei be the column vector that is zero except for a 1 in position i. The product ejDei=dij. Let’s write ei and ej in terms of the orthonormal basis u1,uN: ei=k=1N(eiuk)uk and ej=k=1N(ejuk)uk. Using this expansion, we compute ejDei in a more complicated way: ejDei=r=1Ns=1N(ejur)(eius)(urDus). But urDus=λs(urus)=0 unless r=s, in which case it equals λr, so ejDei=r=1Nλr(ejur)(eiur). On the other hand, Pei=[(eiu1)(eiu2)(eiuN)] and ΛPei=[λ1(eiui)λ2(eiu2)λN(eiuN)] Therefore the i,j entry of PΛP is (ejP)Λ(Pej)=r=1Nλr(eiur)(ejur)=dij so the two matrices D and PΛP are in fact equal.

Exercises:

  1. Prove the rest of the first lemma in section 1.4.2.

  2. Prove the Gram-Schmidt Process has the claimed properties in section 1.5.1.