Introduction
Suppose that we are given a collection of data made up of samples from two different classes, and we would like to develop an algorithm that can distinguish between the two classes. For example, given a picture that is either a dog or a cat, we’d like to be able to say which of the pictures are dogs, and which are cats. For another example, we might want to be able to distinguish “real” emails from “spam.” This type of problem is called a classification problem.
Typically, one approaches a classification problem by beginning with a large set of data for which you know the classes, and you use that data to train an algorithm to correctly distinguish the classes for the test cases where you already know the answer. For example, you start with a few thousand pictures labelled “dog” and “cat” and you build your algorithm so that it does a good job distinguishing the dogs from the cats in this initial set of training data. Then you apply your algorithm to pictures that aren’t labelled and rely on the predictions you get, hoping that whatever let your algorithm distinguish between the particular examples will generalize to allow it to correctly classify images that aren’t pre-labelled.
Because classification is such a central problem, there are many approaches to it. We will see several of them through the course of these lectures. We will begin with a particular classification algorithm called “Support Vector Machines” (SVM) that is based on linear algebra. The SVM algorithm is widely used in practice and has a beautiful geometric interpretation, so it will serve as a good beginning for later discussion of more complicated classification algorithms.
Incidentally, I’m not sure why this algorithm is called a “machine”; the algorithm was introduced in the paper [1] where it is called the “Optimal Margin Classifier” and as we shall see that is a much better name for it.
My presentation of this material was heavily influenced by the beautiful paper [2].
A simple example
Let us begin our discussion with a very simple dataset (see [3] and [4]). This data consists of various measurements of physical characteristics of 344 penguins of 3 different species: Gentoo, Adelie, and Chinstrap. If we focus our attention for the moment on the Adelie and Gentoo species, and plot their body mass against their culmen depth, we obtain the following scatterplot.
Incidentally, a bird’s culmen is the upper ridge of their beak, and the culmen depth is a measure of the thickness of the beak. There’s a nice picture at [4] for the penguin enthusiasts.
A striking feature of this scatter plot is that there is a clear separation between the clusters of Adelie and Gentoo penguins. Adelie penguins have deeper culmens and less body mass than Gentoo penguins. These characteristics seem like they should provide a way to classify a penguin between these two species based on these two measurements.
One way to express the separation between these two clusters is to observe that one can draw a line on the graph with the property that all of the Adelie penguins lie on one side of that line and all of the Gentoo penguins lie on the other. In Figure 7.2 I’ve drawn in such a line (which I found by eyeballing the picture in Figure 7.1). The line has the equation
The fact that all of the Gentoo penguins lie above this line means that, for the Gentoo penguins, their body mass in grams is at least more than times their culmen depth in mm. (Note that the axis of the graph is scaled by grams).
while
Now, if we measure a penguin caught in the wild, we can compute for that penguin and if this number is greater than the penguin’s mass, we say it’s an Adelie; otherwise, a Gentoo. Based on the experimental data we’ve collected – the training data – this seems likely to work pretty well.
The general case
To generalize this approach, let’s imagine now that we have samples and features (or measurements) for each sample. As before, we can represent this data as an data matrix . In the penguin example, our data matrix would be , with one row for each penguin and the columns representing the mass and the culmen depth. In addition to this numerical data, we have a classification that assigns each row to one of two classes. Let’s represent the classes by a vector , where if the sample is in one class, and if that sample is in the other. Our goal is to predict based on – but unlike in linear regression, takes on the values of .
In the penguin case, we were able to find a line that separated the two classes and then classify points by which side of the line the point was on. We can generalize this notion to higher dimensions. Before attacking that generalization, let’s recall a few facts about the generalization to of the idea of a line.
Hyperplanes
The correct generalization of a line given by an equation in is an equation where is a degree one polynomial
It’s easier to understand the geometry of an equation like in Equation 7.1 if we think of the coefficients as forming a nonzero vector in and writing the formula for as .
Lemma: Let with a nonzero vector and a constant in .
- The inequalities and divide up into two disjoint subsets (called half spaces), in the way that a line in divides the plane in half.
- The vector is normal vector to the hyperplane . Concretely this means that if and are any two points in that hyperplane, then .
- Let be a point in . Then the perpendicular distance from to the hyperplane is
Proof: The first part is clear since the inequalities are mutually exclusive. For the secon part, suppose that and satisfy . Then . Subtracting these two equations gives , so is orthogonal to .
For the third part, consider Figure 7.3. The point is an arbitrary point on the hyperplane defined by the equation . The distance from the hyperplane to is measured along the dotted line perpendicular to the hyperplane. The dot product where is the angle between and – which is complementary to the angle between and the hyperplane. The distance is therefore However, since lies on the hyperplane, we know that so . Therefore , which is the formula we seek.
Linear separability and Margins
Now we can return to our classification scheme. The following definition generalizes our two dimensional picture from the penguin data.
Definition: Suppose that we have an data matrix and a set of labels that assign the samples to one of two classes. Then the labelled data is said to be linearly separable if there is a vector and a constant so that, if , then whenever is a row of – a sample – belonging to the class, and whenever belongs to the class. The solutions to the equation in this situation form a hyperplane that is called a separating hyperplane for the data.
In the situation where our data falls into two classes that are linearly separable, our classification strategy is to find a separating hyperplane for our training data. Then, given a point whose class we don’t know, we can evaluate and assign to a class depending on whether or .
This definition begs two questions about a particular dataset:
- How do we tell if the two classes are linearly separable?
- If the two sets are linearly separable, there are infinitely many separating hyperplanes. To see this, look back at the penguin example and notice that we can ‘wiggle’ the red line a little bit and it will still separate the two sets. Which is the ‘best’ separating hyperplane?
Let’s try to make the first of these two questions concrete. We have two sets of points and in , and we want to (try to) find a vector and a constant so that takes strictly positive values for and strictly negative ones for . Let’s approach the problem by first choosing and then asking whether there is a that will work. In the two dimensional case, this is equivalent to choosing the slope of our line, and then asking if we can find an intercept so that the line passes between the two classes.
In algebraic terms, we are trying to solve the following system of inequalities: given , find so that: and This is only going to be possible if there is a gap between the smallest value of for and the largest value of for . In other words, given there is a so that separates and if If this holds, then choose so that lies in this open interval and you will obtain a separating hyperplane.
Proposition: The sets and are linearly separable if there is a so that If this inequality holds for some , and within this open interval, then is a separating hyperplane for and .
*Figure 7.4 is an illustration of this argument for a subset of the penguin data. Here, we have fixed coming from the line that we eyeballed earlier. For each Gentoo (green) point , we computed and drew the line giving a family of parallel lines through each of the green points. Similarly for each Adelie (blue) point we drew the corresponding line. The maximum value of for the blue points turned out to be and the minimum value of for the green points turned out to be . Thus we have two lines with a gap between them, and any parallel line in that gap will separate the two sets.
Finally, among all the lines with this particular , it seems that the best separating line is the one running right down the middle of the gap between the boundary lines. Any other line in the gap will be closer to either the blue or green set that the midpoint line is.
Let’s put all of this together and see if we can make sense of it in general.
Suppose that and are finite point sets in and such that Let be a point in with and be a point in with . The two hyperplanes have the property that: and
Hyperplanes like and , which “just touch” a set of points, are called supporting hyperplanes.
Definition: Let be a set of points in . A hyperplane is called a supporting hyperplane for if for all and for at least one point in , or if for all and for at least one point in .
The gap between the two supporting hyperplanes and is called the margin between and for .
Definition: Let and be as in the discussion above for point sets and and vector . Then the orthogonal distance between the two hyperplanes and is called the geometric margin (along ) between and . We have
Now we can propose an answer to our second question about the best classifying hyperplane.
Definition: The optimal margin between and is the largest value of over all possible for which : If is such that , then the hyperplane is the optimal margin classifying hyperplane.
The optimal classifying hyperplane runs “down the middle” of the gap between the two supporting hyperplanes and that give the sides of the optimal margin.
We can make one more observation about the maximal margin. If we find a vector so that and are the two supporting hyperplanes such that the gap between them is the optimal margin, then this gap gives us an estimate on how close together the points in and can be. This is visible in Figure 7.4, where it’s clear that to get from a blue point to a green one, you have to cross the gap between the two supporting hyperplanes.
Proposition: The closest distance between points in and is greater than or equal to the optimal margin: .
Proof: We have and . These two inequalities imply that Therefore and so
If this inequality were always strict – that is, if the optimal margin equalled the minimum distance between points in the two clusters – then this would give us an approach to finding this optimal margin.
Unfortunately, that isn’t the case. In Figure 7.5, we show a very simple case involving only six points in total in which the distance between the closest points in and is larger than the optimal margin.
At least now our problem is clear. Given our two point sets and , find so that is maximal among all where . This is an optimization problem, but unlike the optimization problems that arose in our discussions of linear regression and principal component analysis, it does not have a closed form solution. We will need to find an algorithm to determine by successive approximations. Developing that algorithm will require thinking about a new concept known as convexity.
Convexity, Convex Hulls, and Margins
In this section we introduce the notion of a convex set and the particular case of the convex hull of a finite set of points. As we will see, these ideas will give us a different interpretation of the margin between two sets and will eventually lead to an algorithm for finding the optimal margin classifier.
Definition: A subset of is convex if, for any pair of points and in , every point on the line segment joining and also belongs to . In vector form, for every , the point belongs to . (Note that , , and so traces out the segment joining to .)
*Figure 7.6 illustrates the difference between convex sets and non-convex ones.
The key idea from convexity that we will need to solve our optimization problem and find the optimal margin is the idea of the convex hull of a finite set of points in .
Definition: Let be a finite set of points in . The convex hull of is the set of points as runs over all positive real numbers such that
There are a variety of ways to think about the convex hull of a set of points , but perhaps the most useful is that it is the smallest convex set that contains all of the points of . That is the content of the next lemma.
Lemma: is convex. Furthermore, let be any convex set containing all of the points of . Then contains .
Proof: To show that is convex, we apply the definition. Let and be two points in , so that let where for . Then a little algebra shows that and . Therefore all of the points belong to , and therefore is convex.
For the second part, we proceed by induction. Let be a convex set containing . Then by the definition of convexity, contains all sums where . Now suppose that contains all the sums where exactly of the are non-zero for some .
Consider a sum with exactly of the . For simplicity let’s assume that for . Now let and set This point belongs to by the inductive hypothesis. Also, . Therefore by convexity of , also belongs to . It follows that all of belongs to .
In Figure 7.7 we show our penguin data together with the convex hull of points corresponding to the two types of penguins. Notice that the boundary of each convex hull is a finite collection of line segments that join the “outermost” points in the point set.
One very simple example of a convex set is a half-plane. More specifically, if is a hyperplane, then the two “sides” of the hyperplane, meaning the subsets and , are both convex. (This is exercise 1 in Section 7.7 ).
As a result of this observation, and the Lemma above, we can conclude that if is a supporting hyperplane for the set – meaning that either for all , or for all , with at least one point such that – then is a supporting hyperplane for the entire convex hull. After all, if for all points , then is contained in the convex set of points where , and therefore is contained in that set as well.
Interestingly, however, the converse is true as well – the supporting hyperplanes of are exactly the same as those for .
Lemma: Let be a finite set of points in and let be a supporting hyperplane for . Then is a supporting hyperplane for .
Proof: Suppose is a supporting hyperplane for . Let’s assume that for all and for a point , since the case where is identical. Since , we have for all . To show that is a supporting hyperplane, we need to know that for at least one point .
Let be the point in where is minimal among all . Note that . Then the hyperplane has the property that on all of , and . Since the halfplane is convex and contains all of , we have contained in that halfplane. So, on the one hand we have . On the other hand , so , so . Since is also greater or equal to zero, we have , and so we have found a point of on the hyperplane . Therefore is also a supporting hyperplane for .
This argument can be used to give an alternative description of as the intersection of all halfplanes containing arising from supporting hyperplanes for . This is exercise 2 in Section 7.7. It also has as a corollary that is a closed set.
Lemma: is compact.
Proof: Exercise 2 in Section 7.7 shows that it is the intersection of closed sets in , so it is closed. Exercise 3 shows that is bounded. Thus it is compact.
Now let’s go back to our optimal margin problem, so that we have linearly separable sets of points and . Recall that we showed that the optimal margin was at most the minimal distance between points in and , but that there could be a gap between the minimal distance and the optimal margin – see Figure 7.5 for a reminder.
It turns out that by considering the minimal distance between and , we can “close this gap.” The following proposition shows that we can change the problem of finding the optimal margin into the problem of finding the closest distance between the convex hulls of and . The following proposition generalizes the Proposition at the end of Section 7.3.2.
Proposition: Let and be linearly separable sets in . Let and be any two points. Then
Proof: As in the earlier proof, choose supporting hyperplanes and for and . By our discussion above, these are also supporting hyperplanes for and . Therefore if and , we have and . As before and so Since this holds for any , we have the result for .
The reason this result is useful is that, as we’ve seen, if we restrict and to and , then there can be a gap between the minimal distance and the optimal margin. If we allow and to range over the convex hulls of these sets, then that gap disappears.
One other consequence of this is that if and are linearly separable then their convex hulls are disjoint.
Corollary: If and are linearly separable then for all and
Proof: The sets are linearly separable precisely when .
Our strategy now is to show that if and are points in and respectively that are at minimal distance , and if we set , then we obtain supporting hyperplanes with margin equal to . Since this margin is the largest possible margin, this must be the optimal . This transforms the problem of finding the optimal margin into the problem of finding the closest points in the convex hulls.
Lemma: Let Then there are points and with . If and are two pairs of points satisfying this condition, then .
Proof: Consider the set of differences
Since is a continuous function on a compact set, it attains its minimum and so there is a with .
Now suppose that there are two distinct points and with . Consider the line segment joining and .
Now Both terms in this difference belong to and respectively, regardless of , by convexity, and therefore belongs to for all .
This little argument shows that is convex. In geometric terms, and are two points in the set equidistant from the origin and the segment joining them is a chord of a circle; as Figure 7.8 shows, in that situation there must be a point on the line segment joining them that’s closer to the origin than they are. Since all the points on that segment are in by convexity, this would contradict the assumption that is the closet point in to the origin.
In algebraic terms, since is the minimal value of for all , we must have .
On the other hand Therefore since and . If , then this derivative would be negative, which would mean that there is a value of where would be less than . Since that can’t happen, we conclude that which means that – the vectors have the same magnitude and are parallel. This establishes uniqueness.
Note: The essential ideas of this argument show that a compact convex set in has a unique point closest to the origin. The convex set in this instance, is called the difference , and it is generally true that the difference of convex sets is convex.
Now we can conclude this line of argument.
Theorem: Let and be points in and respectively are such that is minimal among all such pairs. Let and set and . Then and are supporting hyperplanes for and respectively and the associated margin is optimal.
Proof: First we show that is a supporting hyperplane for . Suppose not. Then there is a point such that . Consider the line segment running from to . By convexity it is entirely contained in . Now look at the distance from points on this segment to : We have so since . This means that is decreasing along and so there is a point along where . This contradicts the fact that is the minimal distance. The same argument shows that is also a supporting hyperplane.
Now the margin for this is and as varies we know this is the largest possible that can occur. Thus this is the maximal margin.
*Figure 7.9 shows how considering the closest point in the convex hulls “fixes” the problem that we saw in Figure 7.5. The closest point occurs at a point on the boundary of the convex hull that is not one of the points in or .
Finding the Optimal Margin Classifier
Now that we have translated our problem into geometry, we can attempt to develop an algorithm for solving it. To recap, we have two sets of points and in that are linearly separable.
We wish to find points and such that
Using the definition of the convex hull we can express this more concretely. Since , there are coefficients for and for so that where .
We can summarize this as follows:
Optimization Problem 1: Write Define To find the supporting hyperplanes that define the optimal margin between and , find and such that is minimal among all such where all and .
This is an example of a constrained optimization problem. It’s worth observing that the objective function is just a quadratic function in the Indeed we can expand to obtain where Thus the function we are trying to minimize is relatively simple.
On the other hand, unlike optimization problems we have seen earlier in these lectures, in which we can apply Lagrange multipliers, in this case some of the constraints are inequalities – namely the requirement that all of the – rather than equalities. There is an extensive theory of such problems that derives from the idea of Lagrange multipliers. However, in these notes, we will not dive into that theory but will instead construct an algorithm for solving the problem directly.
Relaxing the constraints
Our first step in attacking this problem is to adjust our constraints and our objective function slightly so that the problem becomes easier to attack.
Optimization Problem 2: This is a slight revision of problem 1 above. We minimize: subject to the constraints that all and
Problem 2 is like problem 1, except we don’t require the sums of the to be one, but only that they be equal to each other; and we modify the objective function slightly. It turns out that the solution to this optimization problem easily yields the solution to our original one.
Lemma: Suppose and satisfy the constraints of problem 2 and yield the minimal value for the objective function . Then . Rescale the to have sum equal to one by dividing by , yielding . Then is a solution to optimization problem 1.
Proof: To show that , suppose that for all and . The one-variable quadratic function takes its minimum value at and its value at that point is negative. Therefore the minimum value of is negative, which means at that minimum point.
For the equivalence, notice that still satisfy the constraints of problem 2. Therefore On the other hand, suppose that are a solution to problem 1. Then Therefore and finally Since is the minimal value, we have so that indeed gives a solution to Problem 1.
Sequential Minimal Optimization
Now we outline an algorithm for solving Problem 2 that is called Sequential Minimal Optimization that was introduced by John Platt in 1998 (See [5] and Chapter 12 of [6]). The algorithm is based on the principle of “gradient ascent”, where we exploit the fact that the negative gradient of a function points in the direction of its most rapid decrease and we take small steps in the direction of the negative gradient until we reach the minimum.
However, in this case simplify this idea a little. Recall that the objective function is a quadratic function in the ’s and that we need to preserve the condition that . So our approach is going to be to take, one at a time, a pair and and change them together so that the equality of the sums is preserved and the change reduces the value of the objective function. Iterating this will take us to a minimum.
So, for example, let’s look at and and, for the moment, think of all of the other ’s as constants. Then our objective function reduces to a quadratic function of these two variables that looks something like: The constraints that remain are , and we are going to try to minimize by changing and by the same amount . Furthermore, since we still must have and , we have
In terms of this single variable , our optimization problem becomes the job of finding the minimum of a quadratic polynomial in one variable subject to the constraint in Equation 7.3. This is easy! There are two cases: the critical point of the quadratic is to the left of , in which case the minimum value occurs at ; or the critical point of the quadratic is to the right of , in which case the critical point occurs there. This is illustrated in Figure 7.10.
Computationally, let’s write Then and using the definition of we obtain the following formula for the critical value of by setting this derivative to zero:
Using this information we can describe the SMO algorithm.
Algorithm (SMO, see [5]):
Given: Two linearly separable sets of points and in .
Find: Points and belonging to and respectively such that
Initialization: Set for and for . Set and Notice that . Let . These sums will remain equal to each other throughout the operation of the algorithm.
Repeat the following steps until maximum value of computed in each iteration is smaller than some tolerance (so that the change in all of the ’s is very small):
- For each pair with and , compute and If then set ; otherwise set . Then update the by the equations:
When this algorithm finishes, and will be very good approximations to the desired closest points.
Recall that if we set , then the optimal margin classifier is
where and . Since we can simplify this to obtain
In Figure 7.11, we show the result of applying this algorithm to the penguin data and illustrate the closest points as found by an implementation of the SMO algorithm, together with the optimal classifying line.
Bearing in mind that the y-axis is scaled by a factor of 200, we obtain the following rule for distinguishing between Adelie and Gentoo penguins – if the culmen depth and body mass put you above the red line, you are a Gentoo penguin, otherwise you are an Adelie.
Inseparable Sets
Not surprisingly, real life is often more complicated than the penguin example we’ve discussed at length in these notes. In particular, sometimes we have to work with sets that are not linearly separable. Instead, we might have two point clouds, the bulk of which are separable, but because of some outliers there is no hyperplane we can draw that separates the two sets into two halfplanes.
Fortunately, all is not lost. There are two common ways to address this problem, and while we won’t take the time to develop the theory behind them, we can at least outline how they work.
Best Separating Hyperplanes
If our sets are not linearly separable, then their convex hulls overlap and so our technique for finding the closest points of the convex hulls won’t work. In this case, we can “shrink” the convex hull by considering combinations of points where and for some . For small enough, reduced convex hulls will be linearly separable – although some outlier points from each class will lie outside of them – and we can find hyperplane that separates the reduced hulls.
In practice, this means we allow a few points to lie on the “wrong side” of the hyperplane. Our tolerance for these mistakes depends on , but we can include in the optimization problem to try to find the smallest that “works”.
Nonlinear kernels
The second option is to look not for separating hyperplanes but instead for separating curves – perhaps polynomials or even more exotic curves. This can be achieved by taking advantage of the form of Equation 7.2. As you see there, the only way the points enter in to the function being minimized is through the inner products . We can adopt a different inner product than the usual Euclidean one, and reconsider the problem using this different inner product. This amounts to embedding our points in a higher dimensional space where they are more likely to be linearly separable. Again, we will not pursue the mathematics of this further in these notes.
Exercises
Prove that, if is a hyperplane in , then the two “sides” of this hyperplane, consisting of the points where and , are both convex sets.
Prove that is the intersection of all the halfplanes as runs through all supporting hyperplanes for where for all .
Prove that is bounded. Hint: show that is contained in a sphere of sufficiently large radius centered at zero, and then that is contained in that sphere as well.
Confirm the final formula for the optimal margin classifier at the end of the lecture.
[2]
Bennett, K. P. and Bredensteiner, E. J. (2000). Duality and geometry in SVM classifiers. In Proceedings of the seventeenth international conference on machine learning (P. Langley, ed). Morgan Kaufmann Publishers.
[6]
Schölkopf, B., Burges, C. and Smola, A. (1998). Advances in kernel methods: Support vector learning. MIT Press.