Suppose that, as usual, we begin with a collection of measurements of different features for a group of samples. Some of these measurements will tell us quite a bit about the difference among our samples, while others may contain relatively little information. For example, if we are analyzing the effect of a certain weight loss regimen on a group of people, the age and weight of the subjects may have a great deal of influence on how successful the regimen is, while their blood pressure might not. One way to help identify which features are more significant is to ask whether or not the feature varies a lot among the different samples. If nearly all the measurements of a feature are the same, it can’t have much power in distinguishing the samples, while if the measurements vary a great deal then that feature has a chance to contain useful information.
In this section we will discuss a way to measure the variability of measurements and then introduce principal component analysis (PCA). PCA is a method for finding which linear combinations of measurements have the greatest variability and therefore might contain the most information. It also allows us to identify combinations of measurements that don’t vary much at all. Combining this information, we can sometimes replace our original system of features with a smaller set that still captures most of the interesting information in our data, and thereby find hidden characteristics of the data and simplify our analysis a great deal.
1.2 Variance and Covariance
1.2.1 Variance
Suppose that we have a collection of measurements of a particular feature . For example, might be the initial weight of the participant in our weight loss study. The mean of the values is
The simplest measure of the variability of the data is called its variance.
Definition: The (sample) variance of the data is
(1)
The square root of the variance is called the standard deviation.
As we see from the formula, the variance is a measure of how ‘spread out’ the data is from the mean.
Recall that in our discussion of linear regression we thought of our set of measurements as a vector – it’s one of the columns of our data matrix. From that point of view, the variance has a geometric interpretation – it is times the square of the distance from the point to the point :
(2)
1.2.2 Covariance
The variance measures the dispersion of measures of a single feature. Often, we have measurements of multiple features and we might want to know something about how two features are related. The covariance is a measure of whether two features tend to be related, in the sense that when one increases, the other one increases; or when one increases, the other one decreases.
Definition: Given measurements and of two features and , the covariance of and is
(3)
There is a nice geometric interpretation of this, as well, in terms of the dot product. If and then
From this point of view, we can see that is positive if the and vectors “point roughly in the same direction” and its negative if they “point roughly in the opposite direction.”
1.2.3 Correlation
One problem with interpreting the variance and covariance is that we don’t have a scale – for example, if is large and positive, then we’d like to say that and are closely related, but it could be just that the entries of and are large. Here, though, we can really take advantage of the geometric interpretation. Recall that the dot product of two vectors satisfies the formula
where is the angle between and . So
Let’s apply this to the variance and covariance, by noticing that
so the quantity
(4)
measures the cosine of the angle between the vectors and .
Definition: The quantity defined in eq. 4 is called the (sample) correlation coefficient between and . We have with if and only if the two vectors and are collinear in .
Figure 1 illustrates data with different values of the correlation coefficient.
Figure 1: Correlation
1.2.4 The covariance matrix
In a typical situation we have many features for each of our (many) samples, that we organize into a data matrix . To recall, each column of corresponds to a feature that we measure, and each row corresponds to a sample. For example, each row of our matrix might correspond to a person enrolled in a study, and the columns correspond to height (cm), weight (kg), systolic blood pressure, and age (in years):
Table 1: A sample data matrix
sample
Ht
Wgt
Bp
Age
A
180
75
110
35
B
193
80
130
40
…
…
…
…
…
U
150
92
105
55
If we have multiple features, as in this example, we might be interested in the variance of each feature and all of their mutual covariances. This “package” of information can be obtained “all at once” by taking advantage of some matrix algebra.
Definition: Let be a data matrix, where the columns of correspond to different features and the rows to different samples. Let be the centered version of this data matrix, obtained by subtracting the mean of column from all the entries in that column. Then the symmetric matrix
is called the (sample) covariance matrix for the data.
Proposition: The diagonal entries of are the variances of the columns of :
and the off-diagonal entries are the covariances of the and columns of :
The sum of the diagonal entries, the trace of is the total variance of the data.
Proof: This follows from the definitions, but it’s worth checking the details, which we leave as an exercise.
1.2.5 Visualizing the covariance matrix
If the number of features in the data is not too large, a density matrix plot provides a tool for visualizing the covariance matrix of the data. A density matrix plot is an grid of plots (where is the number of features). The entry with coordinates in the grid is a scatter plot of the feature against the one if , and is a histogram of the variable if .
Figure 2 is an example of a density matrix plot for a dataset with samples and features. This data has been centered, so it can be represented in a data matrix . The upper left and lower right graphs are scatter plots of the two columns, while the lower left and upper right are the histograms of the columns.
Figure 2: Density Matrix Plot
1.2.6 Linear Combinations of Features (Scores)
Sometimes useful information about our data can be revealed if we combine different measurements together to obtain a “hybrid” measure that captures something interesting. For example, in the Auto MPG dataset that we studied in the section on Linear Regression, we looked at the influence of both vehicle weight and engine displacement on gas mileage; perhaps their is some value in considering a hybrid “score” defined as for some constants and – maybe by choosing a good combination we could find a better predictor of gas mileage than using one or the other of the features individually.
As another example, suppose we are interested in the impact of the nutritional content of food on weight gain in a study. We know that both calorie content and the level dietary fiber contribute to the weight gain of participants eating this particular food; maybe there is some kind of combined “calorie/fiber” score we could introduce that captures the impact of that food better.
Definition: Let be a (centered) data matrix giving information about features for each of samples. A linear synthetic feature, or a linear score, is a linear combination of the features. The linear score is defined by constants so that If are the values of the features for a particular sample, then the linear score for that sample is
Lemma: The values of the linear score for each of the samples can be calculated as
(5)
Proof: Multiplying a matrix by a column vector computes a linear combination of the columns – that’s what this lemma says. Exercise 3 asks you to write out the indices and make sure you believe this.
1.2.7 Mean and variance of scores
When we combine features to make a hybrid score, we assume that the features were centered to begin with, so that each features has mean zero. As a result, the mean of the hybrid features is again zero.
Lemma: A linear combination of features with mean zero again has mean zero.
Proof: Let be the score for the sample, so where has entries . Then the mean value of the score is Reversing the order of the sum yields where is the mean of the feature (column) of .
The variance is more interesting, and gives us an opportunity to put the covariance matrix to work. Remember from 2 that, since a score has mean zero, it’s variance is – where here the score is represented by the column vector with entries as in eq. 5.
Lemma: The variance of the score with weights is (6) More generally, if and are scores with weights and respectively, then the covariance is
Proof: From eq. 2 and 5 we know that and Since , this gives us as claimed.
For the covariance, use a similar argument with eq. 3 and eq. 5. writing and the fact that and can be written as and .
The point of this lemma is that the covariance matrix contains not just the variances and covariances of the original features, but also enough information to construct the variances and covariances for any linear combination of features.
In the next section we will see how to exploit this idea to reveal hidden structure in our data.
1.2.8 Geometry of Scores
Let’s return to the dataset that we looked at in section 1.2.5. We simplify the density matrix plot in fig. 3, which shows one of the scatter plots and the two histograms.
The scatter plot shows that the data points are arranged in a more or less elliptical cloud oriented at an angle to the -axes which represent the two given features. The two individual histograms show the distribution of the two features – each has mean zero, with the -features distributed between and and the feature between and . Looking just at the two features individually, meaning only at the two histograms, we can’t see the overall elliptical structure.
Figure 3: Simulated Data with Two Features
How can we get a better grip on our data in this situation? We can try to find a “direction” in our data that better illuminates the variation of the data. For example, suppose that we pick a unit vector at the origin pointing in a particular direction in our data. See fig. 4.
Figure 4: A direction in the data
Now we can orthogonally project the datapoints onto the line defined by this vector, as shown in fig. 5.
Figure 5: Projecting the datapoints
Recall that if the unit vector is defined by coordinates , then the orthogonal projection of the point with coordinates is . Now so the coordinates of the points along the line defined by are the values of the score defined by . Using our work in the previous section, we see that we can find all of these coordinates by matrix multiplication: where is our data matrix. Now let’s add a histogram of the values of to our picture:
Figure 6: Distribution of Z
This histogram shows the distribution of the values of along the tilted line defined by the unit vector .
Finally, using our work on the covariance matrix, we see that the variance of is given by where is the covariance matrix of the data .
Lemma: Let be a centered data matrix, and let be the associated covariance matrix. Let be a unit vector in “feature space” . Then the score can be interpreted as the coordinates of the points of projected onto the line generated by . The variance of this score is where is the dot product of the row with . It measures the variability in the data “in the direction of the unit vector .”
1.3 Principal Components
1.3.1 Change of variance with direction
As we’ve seen in the previous section, if we choose a unit vector in the feature space and find the projection of our data onto the line through , we get a “score” that we can use to measure the variance of the data in the direction of . What happens as we vary ?
To study this question, let’s continue with our simulated data from the previous section, and introduce a unit vector This is in fact a unit vector, since , and it is oriented at an angle from the -axis.
The variance of the data in the direction of is given by
A plot of this function for the data we have been considering is in fig. 7. As you can see, the variance goes through two full periods with the angle, and it reaches a maximum and minimum value at intervals of – so the two angles where the variance are maximum and minimum are orthogonal to one another.
Figure 7: Change of variance with angle theta
The two directions where the variance is maximum and minimum are drawn on the original data scatter plot in fig. 8 .
Figure 8: Data with principal directions
Let’s try to understand why this is happening.
1.3.2 Directions of extremal variance
Given our centered, data matrix , with its associated covariance matrix , we would like to find unit vectors in so that reaches its maximum and its minimum. Here is the variance of the “linear score” and it represents how dispersed the data is in the “u direction” in .
In this problem, remember that the coordinates of are the variables and the symmetric matrix is given. As usual, we to find the maximum and minimum values of , we should look at the partial derivatives of with respect to the variables and set them to zero. Here, however, there is a catch – we want to restrict to being a unit vector, with .
So this is a constrained optimization problem:
Find extreme values of the function
Subject to the constraint (or )
We will use the technique of Lagrange Multipliers to solve such a problem.
To apply this method, we introduce the function
(7)
Then we compute the gradient
(8)
and solve the system of equations . Here we have written the gradient as a column vector for reasons that will become clearer shortly.
Computing all of these partial derivatives looks messy, but actually if we take advantage of matrix algebra it’s not too bad. The following two lemmas explain how to do this.
Lemma: Let be a matrix with constant coefficients and let be a column vector whose entries are . The function is a linear map from . Its (total) derivative is a linear map between the same vector spaces, and satisfies for any vector . If is a matrix, and , then
for any vector . (This is the matrix version of the derivative rule that for a constant .)
Proof: Since , we can write out in more traditional function notation as where Thus . The total derivative is the linear map with matrix and so .
The other result is proved the same way.
Lemma: Let be a symmetric matrix with constant entries and let be an column vector of variables . Let be the function . Then the gradient is a vector field – that is, a vector-valued function of , and is given by the formula
Proof: Let be the entry of . We can write out the function to obtain Now is going to pick out only terms where appears, yielding: Here the first sum catches all of the terms where the first “u” is ; and the second sum catches all the terms where the second “u” is . The diagonal terms contribute once to each sum, which is consistent with the rule that the derivative of . To finish the proof, notice that since is symmetric, so in fact the two terms are the same Thus But the right hand side of this equation is twice the entry of , so putting the results together we get
The following theorem puts all of this work together to reduce our questions about how variance changes with direction.
1.3.3 Critical values of the variance
Theorem: The critical values of the variance , as varies over unit vectors in , are the eigenvalues of the covariance matrix , and if is a unit eigenvector corresponding to , then .
Proof: Recall that we introduced the Lagrange function , whose critical points give us the solutions to our constrained optimization problem. As we said in eq. 7: Now apply our Matrix calculus lemmas. First, let’s treat as a constant and focus on the variables. We can write where is the identity matrix to compute: For we have The critical points occur when and The first equation says that must be an eigenvalue, and an eigenvector: while the second says must be a unit vector . The second part of the result follows from the fact that if is a unit eigenvector with eigenvalue then
To really make this result pay off, we need to recall some key facts about the eigenvalues and eigenvectors of symmetric matrices. Because these facts are so central to this result, and to other applications throughout machine learning and mathematics generally, we provide proofs in section 1.5.
Table 2: Properties of Eigenvalues of Real Symmetric Matrices
Summary
1. All of the eigenvalues of are real. If for all , then all eigenvalues are non-negative. In the latter case we say that is positive semi-definite.
2. If is an eigenvector for with eigenvalue , and is an eigenvector with a different eigenvalue , then and are orthogonal: .
3. There is an orthonormal basis of made up of eigenvectors of corresponding to the eigenvalues .
4. Let be the diagonal matrix with entries and let be the matrix whose columns are made up of the vectors . Then
If we combine this theorem with the facts summarized in table 2 then we get a complete picture. Let be the covariance matrix of our data. Since we know that the eigenvalues are all nonnegative. Choose a corresponding sequence of orthogonal eigenvectors where all . Since the form a basis of , any score is a linear combination of the : Since unless , in which case it is , we can compute and since the are an orthonormal set. So in these coordinates, our optimization problem is:
maximize
subject to the constraint .
We don’t need any fancy math to see that the maximum happens when and the other , and in that case, the maximum is . (If occurs more than once, there may be a whole subspace of directions where the variance is maximal). Similarly, the minimum value is and occurs when and the others are zero.
1.3.4 Subspaces of extremal variance
We can generalize the idea of the variance of our data in a particular direction to a higher dimensional version of total variance in a subspace. Suppose that is a subspace of and is a matrix whose columns span – the columns of are the weights of a family of scores that span . The values of these scores are and the covariance matrix of this projected data is .
Finally, the total variance of the data projected into is the sum of the diagonal entries of the matrix
Just as the variance in a given direction depends on the scaling of , the variance in a subspace depends on the scaling of the columns of . To normalize this scaling, we assume that the columns of are an orthonormal basis of the subspace .
Now we can generalize the question asked in section 1.3.2 by seeking, not just a vector pointing in the direction of the extremal variance, but instead the subspace of dimension with the property that the total variance of the projection of the data into is maximal compared to its projection into other subspaces of that dimension.
Theorem: Let be the eigenvectors of with eigenvalues . Given , choose of the largest eigenvalues of and let be the matrix whose columns are the corresponding eigenvectors. (There may be several choices here, since some of the eigenvalues may be the same.) The subspaces constructed in this way have the largest total variance among all subspaces of dimension , and that total variance is . These are called subspaces of extremal variance.
Overlooking the issue of repeated eigenvalues, this can be phrased more concisely by saying that the eigenvectors corresponding to the largest eigenvalues span the subspace of dimension such that the projection of the data into that subspace has maximal variance among all subspaces of dimension .
Proof: To make this concrete, suppose we consider a subspace of of dimension with basis . Complete this to a basis of and then apply the Gram Schmidt Process (see section 1.5.1) to find an orthonormal basis where the are an orthonormal basis for . Let be the matrix whose columns are the for . The rows of the matrix given the coordinates of the projection of each sample into the subspace expressed in terms of the scores corresponding to these vectors . The total variance of these projections is
If we want to maximize this, we have the constrained optimization problem of finding so that
is maximal
subject to the constraint that each has ,
and that the are orthogonal, meaning for ,
and that the are linearly independent.
Then the span of these is subspace of extremal variance.
Theorem: A -dimensional subspace is a subspace of extremal variance if and only if it is spanned by orthonormal eigenvectors of the matrix corresponding to the largest eigenvalues for .
Proof: We can approach this problem using Lagrange multipliers and matrix calculus if we are careful. Our unknown is matrix whose columns are the (unknown) vectors . The objective function that we are seeking to maximize is The constraints are the requirements that and if . If we introduction a matrix of lagrange multipliers , where is the multiplier that goes with the the first of these constraints when , and the second when , we can express our Lagrange function as: where is the identity matrix.
Taking the derivatives with respect to the entries of and of yields the following two equations:
The first of these equations says that the space spanned by the columns of is invariant under , while the second says that the columns of form an orthonormal basis.
Let’s assume for the moment that we have a matrix that satisfies these conditions.
Then it must be the case that is a symmetric, real valued matrix, since and the matrix on the left is symmetric.
By the properties of real symmetric matrices (the spectral theorem), there are orthonormal vectors that are eigenvectors of with corresponding eigenvalues . If we let be the matrix whose columns are the vectors and let be the diagonal matrix whose entries are the , we have
If we go back to our original equations, we see that if exists such that , then there is a matrix with orthonormal columns and a diagonal matrix such that In other words, is a matrix whose columns are eigenvectors of with eigenvalues for .
Thus we see how to construct an invariant subspace and a solution matrix . Such an is spanned by orthonormal eigenvectors with eigenvalues of ; and is is the matrix whose columns are the . Further, in that case, the total variance associated to is the sum of the eigenvalues ; to make this as large as possible, we should choose our eigenvectors to correspond to of the largest eigenvalues of . This concludes the proof.
1.3.5 Definition of Principal Components
Definition: The orthonormal unit eigenvectors for are the principal directions or principal components for the data .
Theorem: The maximum variance occurs in the principal direction(s) associated to the largest eigenvalue, and the minimum variance in the principal direction(s) associated with the smallest one. The covariance between scores in principal directions associatedwith different eigenvalues is zero.
At this point, the picture in fig. 8 makes sense – the red and green dashed lines are the principal directions, they are orthogonal to one another, and the point in the directions where the data is most (and least) “spread out.”
Proof: The statement about the largest and smallest eigenvalues is proved at the very end of the last section. The covariance of two scores corresponding to different eigenvectors and is since the and are orthogonal.
Sometimes the results above are presented in a slightly different form, and may be referred to, in part, as Rayleigh’s theorem.
Corollary: (Rayleigh’s Theorem) Let be a real symmetric matrix and let Then is the largest eigenvalue of . (Similarly, if we replace by , then the minimum is the least eigenvalue).
Proof: The maximum of the function is the solution to the same optimization problem that we considered above.
Exercises.
Prove that the two expressions for given in section 1.2.1 are the same.
Prove that the covariance matrix is as described in the proposition in 1.2.4.
Let be a matrix with entries for and . If a linear score is defined by the constants , check that equation eq. 5 holds as claimed.
Why is it important to use a unit vector when computing the variance of in the direction of ? Suppose where is a unit vector and is a constant. Let be the score . How is the variance of related to that of ?
1.4 Dimensionality Reduction via Principal Components
The principal components associated with a dataset separate out directions in the feature space in which the data is most (or least) variable. One of the main applications of this information is to enable us to take data with a great many features – a set of points in a high dimensional space – and, by focusing our attention on the scores corresponding to the principal directions, capture most of the information in the data in a much lower dimensional setting.
To illustrate how this is done, let be a data matrix, let be its centered version, and let be the associated covariance matrix.
Apply the spectral theorem (described in table 2) and proved in section 1.5 to the covariance matrix to obtain eigenvalues and associated eigenvectors . The scores give the values of the data in the principal directions. The variance of is .
Now choose a number and consider the vectors . The entry in is the value of the score for the data point. Because capture the most significant variability in the original data, we can learn a lot about our data by considering just these features of the data, instead of needing all .
To illustrate, let’s look at an example. We begin with a synthetic dataset which has samples and features. The data (some of it) for some of the samples is shown in table 3.
Table 3: Simulated Data for PCA Analysis
f-0
f-1
f-2
f-3
f-4
...
f-10
f-11
f-12
f-13
f-14
s-0
1.18
-0.41
2.02
0.44
2.24
...
0.32
0.95
0.88
1.10
0.89
s-1
0.74
0.58
1.54
0.23
2.05
...
0.99
1.14
1.56
0.99
0.59
...
...
...
...
...
...
...
...
...
...
...
...
s-198
1.04
2.02
1.44
0.40
1.33
...
0.62
0.62
0.54
1.96
0.04
s-199
0.92
2.09
1.58
1.19
1.17
...
0.42
0.85
0.83
2.22
0.90
The full dataset is a matrix; it has numbers in it and we’re not really equipped to make sense of it. We could try some graphing – for example, fig. 9 shows a scatter plot of two of the features plotted against each other.
Figure 9: Scatter Plot of Two Features
Unfortunately there’s not much to see in fig. 9 – just a blob – because the individual features of the data don’t tell us much in isolation, whatever structure there is in this data arises out of the relationship between different features.
In fig. 10 we show a “density grid” plot of the data. The graph in position shows a scatter plot of the and columns of the data, except in the diagonal positions, where in position we plot a histogram of column . There’s not much structure visible; it is a lot of blobs.
Figure 10: Density Grid Plot of All Features
So let’s apply the theory of principal components. We use a software package to compute the eigenvalues and eigenvectors of the matrix . The eigenvalues are plotted, in descending order, in fig. 11 .
Figure 11: Eigenvalues of the Covariance Matrix
This plot shows that the first eigenvalues are relatively large, while the remaining are smaller and not much different from each other. We interpret this as saying that most of the variation in the data is accounted for by the first four principal components. We can even make this quantitative. The total variance of the data is the sum of the eigenvalues of the covariance matrix – the trace of – and in this example that sum is around . The sum of the first eigenvalues is about , so the first four eignvalues account for about of the total variance, or about of the variation of the data.
Now let’s focus in on the two largest eigenvalues and and their corresponding eigenvectors and . The column vectors and are the values of the scores associated with these two eigenvectors. So for each data point (each row of ) we have two values (the corresponding entries of and .) In fig. 12 we show a scatter plot of these scores.
Figure 12: Scatter Plot of Scores in the First Two Principal Directions
Notice that suddenly some structure emerges in our data! We can see that the 200 points are separated into five clusters, distinguished by the values of their scores! This ability to find hidden structure in complicated data, is one of the most important applications of principal components.
If we were dealing with real data, we would now want to investigate the different groups of points to see if we can understand what characteristics the principal components have identified.
1.4.1 Loadings
There’s one last piece of the PCA puzzle that we are going to investigate. In fig. 12, we plotted our data points in the coordinates given by the first two principal components. In geometric terms, we took the cloud of points in given by the rows of and projected those points into the two dimensional plane spanned by the eigenvectors and , and then plotted the distribution of the points in that plane.
More generally, suppose we take our dataset and consider the first principal components corresponding to the eigenvectors . The projection of the data into the space spanned by these eigenvectors is the represented by the matrix where is the matrix whose columns are the eigenvectors . Each row of gives the values of the score arising from in the column for .
The remaining question that we wish to consider is: how can we see some evidence of the original features in subspace? We can answer this by imagining that we had an artificial sample that has a measurement of for the feature and a measurement of zero for all the other features. The corresponding point is represented by a row vector with a in position . The projection of this synthetic sample into the span of the first principal components is the vector . Notice, however, that is just the row of the matrix . This vector in the space spanned by the is called the “loading” of the feature in the principal components.
This is illustrated in section 1.4.1, which shows a line along the direction of the loading corresponding to the each feature added to the scatter plot of the data in the plane spanned by the first two principal components. One observation one can make is that some of the features are more “left to right,” like features and , while others are more “top to bottom,” like . So points that lie on the left side of the plot have smaller values of features and , while those at the top of the plot have larger values of feature .
Figure 13: Loadings in the Principal Component Plane
In the next, and last, section, of this discussion of Principal Component Analysis, we will give proofs of the key mathematical ideas summarized earlier in table 2, which have been central to this analysis.
1.4.2 The singular value decomposition
The singular value decomposition is a slightly different way of looking at principal components. Let be the diagonal matrix of eigenvalues of ; we know that the entries of are non-negative. Let’s drop the eigenvectors corresponding to the zero eigenvalue. Let’s say that there are non-zero eigenvalues, and corresponding eigenvectors.
Lemma: Let be the matrix whose columns are the eigenvectors with non-zero eigenvalues, and let be the diagonal matrix whose entries are the non-zero eigenvalues. Then .
Proof: First observe that is in fact an matrix. Then look at the block structure to verify the result.
The matrix is diagonal, invertible, and, since the eigenvalues are positive, it makes sense to consider the real matrix whose entries are the square roots of the eigenvalues.
Let . Note that is a dimensional matrix.
Lemma: The columns of are orthonormal.
Proof: Compute the matrix , whose entries are all of the dot products of the columns of : by the previous lemma and the fact that is the identity matrix.
Rearranging this yields the singular value decomposition.
Theorem: (The singular value decomposition) The matrix has a decomposition: where (of dimension ) and (of dimension ) are orthogonal, and (of dimension ) is diagonal with positive entries. Furthermore, the entries of are the non-negative eigenvalues of , and and are uniquely determined by .
Proof: We won’t work through all of this, as it is a reinterpretation of our work on principal components.
Remark: The entries of are called the singular values of . They can be found directly by considering the optimization problem implicitly equivalent to the problem we solved in section 1.3.4.
1.5 Eigenvalues and Eigenvectors of Real Symmetric Matrices (The Spectral Theorem)
Now that we’ve shown how to apply the theory of eigenvalues and eigenvectors of symmetric matrices to extract principal directions from data, and to use those principal directions to find structure, we will give a proof of the properties that we summarized in table 2.
A key tool in the proof is the Gram-Schmidt orthogonalization process.
1.5.1 Gram-Schmidt
Proposition (Gram-Schmidt Process): Let be a collection of linearly independent vectors in and let be the span of the . Let and let for . Then
The vectors are orthogonal: unless .
The vectors span .
Each is orthogonal to the all of .
The vectors are orthonormal.
Proof: This is an inductive exercise, and we leave it to you to work out the details.
1.5.2 The spectral theorem
Theorem: Let be a real symmetric matrix. Then:
All of the eigenvalues are real. If for all , then all eigenvalues .
The matrix is diagonalizable – that is, it has linearly independent eigenvectors.
If and are eigenvectors corresponding to eigenvalues and , with , then and are orthogonal: .
There is an orthonormal basis of made up of eigenvectors for the eigenvalues .
Let be the diagonal matrix with entries and let be the matrix whose columns are made up of the eigenvectors . Then .
Proof: Start with . Suppose that is an eigenvalue of . Let be a corresponding nonzero eigenvector. Then and , where is the vector whose entries are the conjugates of the entries of (and since is real). Now we have and But the left hand side of both of these equations are the same (take the transpose and use the symmetry of ) so we must have so , meaning is real.
If we have the additional property that for all , then in particular , and since we must have .
Property is in some ways the most critical fact. We know from the general theory of the characteristic polynomial, and the fundamental theorem of algebra, that has complex eigenvalues, although some may be repeated. However, it may not be the case that has linearly independent eigenvectors – it may not be diagonalizable. So we will establish that now.
A one-by-one matrix is automatically symmetric and diagonalizable. In the -dimensional case, we know, at least, that has at least one eigenvector, and real one at that by part , and this gives us a place to begin an inductive argument.
Let be an eigenvector with eigenvalue and normalized so that ,
and extend this to a basis of . Apply the Gram-Schmidt process to construct an orthonormal basis of so that .
Any vector is a linear combination and, since the are orthonormal, the coefficients can be calculated as .
Using this, we can find the matrix of the linear map defined by our original matrix in this new basis. By definition, if are the entries of , then
and so
Since is symmetric, and so . In other words, the matrix is still symmetric. Furthermore,
since . Since the are an orthonormal basis, we see that unless , and .
In other words, the matrix has a block form: and the block denoted by ’s is symmetric. If we call that block , the inductive hypothesis tells us that the symmetric matrix is diagonalizable, so it has a basis of eigenvectors with eigenvalues ; this gives us a basis for the subspace of spanned by which, together with gives us a basis of consisting of eigenvectors of .
This finishes the proof of Property .
For property , compute Since , we must have .
For property , if the eigenvalues are all distinct, this is a consequence of property – you have eigenvectors, scaled to length , for different eigenvalues, and by they are orthogonal. So the only complication is the case where some eigenvalues are repeated. If occurs times, then you have linearly independent vectors that span the eigenspace. The Gram-Schmidt process allows you to construct an orthonormal set that spans this eigenspace, and while this orthonormal set isn’t unique, any one of them will do.
For property , let be the column vector that is zero except for a in position . The product . Let’s write and in terms of the orthonormal basis : Using this expansion, we compute in a more complicated way: But unless , in which case it equals , so On the other hand, and Therefore the entry of is so the two matrices and are in fact equal.
Exercises:
Prove the rest of the first lemma in section 1.4.2.
Prove the Gram-Schmidt Process has the claimed properties in section 1.5.1.