Maloney Home Page  |  Support vector regression applied to Raman spectra (part 1)

Context

These notes are intended for the Simmons Mechanobiology Lab (and the group members working on the project “Measurement of Contractility-induced Residual Stress for Tissue Engineering” directed by Prof. Sarntinoranont) and their exploration of machine learning to relate Raman spectra to compliant substratum strain states.

Part 1 of these notes covers classification of linearly separable groups by maximizing the margin, transformation to achieve linear separability, Hesse normal form, the implementation of QP optimization, and equivalence to convex hulls. Part 2 will cover the transition from the primal to the dual formulation, the kernel trick, slack variables, cross validation, the transition from support vector classification to support vector regression, and the incorporation of principal component analysis.

Our problem is as follows: If the following Raman spectra correspond to 0% and 40% uniaxial strain of polyacrylamide:

(to be inserted following publication, or make up a dummy plot)

then what strain does the following spectrum correspond to?

(to be inserted following publication, or make up a dummy plot)

Potentially, sophisticated learning and prediction techniques will ultimately be useful or necessary for this project. In the following content, I've tried to turn my notes—a beginner's notes—into a tutorial relevant to the group. These notes also serve as an introduction to support vector machines for engineers getting up to speed in machine learning.

Strategy

My first aim is to understand and clarify the ample jargon (e.g., what is a “maximum margin linear classifier”?) and general concepts of the field rather than to dive directly into the deep mathematical machinery of machine learning.

Many textbooks on classification, for example, immediately introduce expressions such as $$(\boldsymbol{x}_1,y_1),\ldots,(\boldsymbol{x}_m,y_m)\,\exists\,\mathcal{X}\times\{\pm 1\},$$ which is easily interpreted by the mathematicians as a compact way of saying that we have a bunch of N-dimensional input observations \(\boldsymbol{x}_m\) and corresponding output labels \(y_m\) and that each pair can be classified in either the -1 or +1 groups.

However, too many such expressions might make our eyes glaze over at first attack. Therefore, these notes focus on defining terms based on our specific interests and on presenting intuitive and straightforward illustrations and reasoning to cover key concepts.

For example, we might visually represent the preceding mathematical expression as follows:


1D example classification model with two groups: red triangles (corresponding to an label of -1) and blue squares (corresponding to an label of +1). How would you separate these groups? (Plotted in Python with this code.)

Here, the individual points are each located by a scalar (as a simplification of the general vector \(\boldsymbol{x_m}\) that simply points in the \(x\) direction) and where the red triangles correspond to -1 and the blue squares to +1 (representing the output labels \(y_m\)).

Specific motivation

Why would we choose to analyze Raman spectra with a support vector machine (probably preceded by principal component analysis, PCA)? I focus on these topics because of the multiple associated literature reports of successful implementation; one example is Chengxu Hu et al. “Raman spectra exploring breast tissues: Comparison of principal component analysis and support vector machine-recursive feature elimination.” Medical Physics 40.6 (2013). In addition, a quick look at the machine learning literature suggests that PCA is well suited to reducing spectra in which several features (such as peaks) change in concert into simpler forms and that support vector machines are powerful and may require less fine tuning as other methods (to do: make this point more reasonable with examples).

Definition of key terms

Let's first identify what flavor of machine learning interests us.

  • We can broadly divide inference efforts between classification and regression.

    For a new data group of interest, classification (left plot below) involves predicting the appropriate associated discrete group (such as whether a certain Raman spectrum was obtained from diseased or healthy tissue); in contrast, regression (right plot below) involves predicting an appropriate associated continuous value (such as the strain state corresponding to a particular Raman spectrum).

    (Left) 2D example of classification: find the best division (gray line) between the red triangles and blue squares. (Right) 2D example of regression: find the best fit (gray line) to the red triangles.

    Since we're interested in predicting quantitative strain values from a Raman spectrum, we're interested in regression. Most machine learning texts address classification first, then regression; in these notes as well, I'll start with classification and continue to refer to relevant classification scenarios when they provide a useful context to define concepts and terms.

  • In addition, let's note that we're engaged in supervised learning: each data input (and inputs are also sometimes called instances, cases, patterns or observations) matches a data output (and outputs are sometimes called labels, targets, or—here also—observations).

    In the plots above, the input and output for the classification data are the 2D position and the classification of –1/+1, respectively; the input and output for the regression data are the \(x\) and \(y\) coordinates, respectively. In our actual problem, of course, the input is a Raman spectrum, and the output is a strain state such as the magnitude of uniaxial stretch.

    (Unsupervised learning, in contrast, doesn't involve output data; one example is clustering.)

  • For both classification and regression tasks, it's often the case that we have input-output pairs that are already known; in our case, these pairs are the applied strain states and corresponding collected Raman spectra. These data are called the training data.

    (Actually, for quality control and to optimize the system, we might take just a portion of these data as the training data, using the rest as validation data to test our learning process when we still know the correct output “answer.”)

    In any case, sometimes we'll know the output (and we'll use this information to refine our classification or regression process), and sometimes we won't (and we'll want to use our refined process to infer new conclusions). The latter case (and one of the key deliverables of this research project) involves recording Raman spectra via optically interrogating regions around contractile cells on synthetic compliant substrata or in tissue.

  • Finally, we're performing batch learning: in the learning process, all data are supplied to the algorithm at once; the learning does not occur in stages (at least that's not what we anticipate at this stage).

These terms help us put our desired analysis method in context, which is particularly valuable when surveying the literature: we seek a regression method that implements supervised batch learning. We anticipate providing training data and validation data in the form of input-output pairs.

1D classification example: Intuitive approach

To launch our machine learning adventure, let's return to our simple classification problem displayed above. In this example, single scalar values are plotted in 1D, and the actual classes are non-overlapping (i.e., we can classify all points correctly using a single division, with no error; another term for this feature is linear separability). Here, we wish to divide the group of red triangles (lying at \(x\) coordinates 2 and 3) from the group of blue squares (lying at coordinates 6, 6.5, and 8.3):


Our 1D example classification model, where the red triangle group can be perfectly separated from the blue square group by a single threshold value. Candidate divisions are A, B, and C; perhaps we intuit that B, lying at the midpoint of the smallest distance separating the two groups—the middle of the “street”—is the optimal choice.

Well, we could assign a division at A, B, or C with absolute success; out of these options, B is special because it lies at the midpoint of the gap and therefore maximizes the margin on either side. That is, if future data that we encountered were to include data closer to the opposing group, then a centered division point would probably be best situated to accommodate this variation:


(a)

(b)

Darker gray circles around each input point represent the maximum potential error that can be accommodated by a certain division. Division A is less accommodating (and therefore more likely to misclassify additional data) than division B; equivalently, we say that B is better at maximizing the margins.

The result of this intuition-based strategy is called the maximum margin linear classifier, just to give an example of the terms that seem awfully opaque until we develop them in easy-to-understand stages.

1D classification example: Developing a systematic approach

For our 1D data, let's call the specific value of the selected division point (equivalently, the threshold or boundary condition) \(a\). Using our intuition, we chose a threshold of \(a=4.5\) (i.e., line B, which lies midway between points at 3 and 6), representing the midpoint between the farthest-right red triangle and the farthest-left blue square, which are circled below.

We arrive at an important point: the position of the maximum margin linear classifier depends only on those circled points (vectors in the N-dimensional case). (The positions of the other points/vectors aren't important except inasmuch as they are correctly classified.) Because these key vectors provide the foundation for our classification procedure—they support classification—they are termed support vectors:


When we divide non-overlapping classes, only the inputs nearest the division point are relevant and are designated support vectors because of their importance. Other inputs—as long as they are correctly classified—have no influence on the process of selecting a division point.

Wow, we can now consider ourselves to be performing support vector classification with very little conceptual burden. The key point to remember is that the support vectors predominate in determining the position of the separating classifier.

Now, one way to classify a new data point \(x\) would be to check whether it's greater than or less than \(a\): do we have \(x\lt a\), or do we have \(x\gt a\)?

But an alternate approach, well suited for classifications comprising –1 and +1, would be to simply calculate the sign of \(x-a\), or \(\mathrm{sgn}(x-a)\). Note that this approach gives the classification (of –1 or +1) automatically, which is convenient. In fact, the sign of any multiple of \((x-a)\) would work, corresponding to different lines (hyperplanes in the N-dimensional case) passing through the division point. We could write the hyperplane equation as \(y=w(x-a)\), for example, for positive \(w\):


The sign function facilitates automatic classification of input values. Shown in gray are three candidate hyperplanes that correctly classify the two groups in this way; the dashed black line shows their common predictive output.

Notably, only one hyperplane passes through the support vectors; this special hyperplane is called the canonical or optimal hyperplane:


The canonical hyperplane (black line) satisfies the correct classification result of –1 and +1 exactly at the support vectors. We seek a way to determine the details of this hyperplane automatically.

Up to this point, we've used intuition to maximize the margin between groups. Let's now try working backward: taking a clue from the canonical hyperplane shown above, let's require that our separator satisfies \(w(x_m-a)\le -1\) for every red triangle (classification \(y_m=-1\)) and \(w(x_m-a)\ge 1\) for every blue square (classification \(y_m=1\)). Hey, wait a second, it's more efficient to write simply \(w(x_m-a)y_m\ge 1\), which applies to all points. So that's clever.

(Here, \(x_m\) and \(y_m\) refer to the specific data points; numbering them from left to right, we'd designate the red triangles as \(x_1\) and \(x_2\) and the blue squares as \(x_3\), \(x_4\), and \(x_5\), so that the input data are \((x_m,y_m)=((2,-1),(3,-1),(6,1),(6.5,1),(8.3,1))\). Clear?)

Check out which hyperplanes satisfy this requirement for divisions A, B, and C:

    
Hyperplanes for divisions A, B, and C. A systematic way to obtain the canonical hyperplane (shown in black) is to ask which hyperplane exhibits the minimum slope while satisfying the necessary classification criterion \(w(x_m-a)y_m\ge 1\). (Plotted in Python with this code.)

Expressed in words, we're requiring not only that the sign of the hyperplane predicts the correct classification but also that the hyperplane never undercuts the support vectors. For example, the hyperplane corresponding to division A is constrained to pass through the far-right red triangle, and the hyperplane corresponding to division C is constrained to pass through the far-left blue square. Note that the slope of the hyperplane is lower when it passes through the optimal division B than when it passes through the inferior divisions at A and B; furthermore, note that the B hyperplane passes through both support vectors.

So we've found (in 1D for the linearly separable case, at least) that minimizing \(w\) while satisfying \(w(x_m-a)y_m\ge 1\) works as a rigorous way to find the maximum margin linear classifier. It turns out that this aspect still holds when we extend the situation to 2D and higher: we'll still seek to minimize the effective slope \(w\) of the hyperplane while never undercutting a support vector.

Let's now consider the complication of input points unavoidably lying on the “wrong side” of the division or threshold, which will force us to extend the 1D case to higher dimensions.

Accommodating overlapping classes through transformation

We've seen that non-overlapping classes (equivalently, linearly separable classes) in 1D are relatively straightforward to work with because the general position of the divider is intuitive; perfect classification is assured. But what if the classes are overlapping?

For overlapping classes, we have at least two approaches at our disposal: we can transform the data to be non-overlapping, or we can accommodate the misclassification through so-called slack variables (to be covered in part 2).

As an example of transformation, consider the following input data, which extends our existing data set by adding another red triangle on the right at \(x=10.5\):


An additional data point on the “wrong side” removes the advantage of linear separability; we need to look for new solutions to classify the data using a single boundary.

The presence of this new data point in the “–1” class precludes error-free classification from a simple single threshold. Notably, however, we could plot the data in a higher-dimensional space in which the \(x\)-axis still plots \(x\) and a second axis plots, say, \((x-7)^2\):


To separate the –1 and +1 classes, we could transform our original scalar \(x\) data to the corresponding \(\phi(x)\) data that are actually separable. This process is as simple as adding a second axis that plots \((x-7)^2\). Gray lines represent potential hyperplanes, or decision criteria, that now correctly classify the linearly separable data. The choice of hyperplane affects the classification of a potential future data point ×.

Let's call this transformation \(\phi\); whereas we once plotted \(x_m\) on a 1D \(x\)-axis, we now plot the vectors \(\boldsymbol{x_m}=\phi(x_m)=\left(x_m,(x_m-7)^2\right)\) on a 2D plane. Linear separability returns!

This technique is remarkable, no? It suggests that with a suitably customized transformation, we could accommodate any additional data points that would hinder making a decision in the original dimensional space. (For the 1D case at least, this process would be as simple as applying some type of polynomial transformation and plotting the results on the second axis. For the N-dimensional case, we could imagine using some sophisticated function, as required.)

The process of specifying such a transformation has been likened to having the data points distributed across a bedsheet and flicking the sheet in such a way that the different classes are successfully separated. And one could imagine that a sufficiently complex flicking process could separate any collection of intermixed data. (But is our transformation needlessly complex to accommodate the details of the training data, without necessarily fitting subsequent data successfully? This will be a topic of interest later when we discuss overfitting.)

Of course, it's true we don't know yet what slope and position we should use for the optimal decision line or hyperplane (except inasmuch as we can expect that we should maximize the margin, as before). The choice of hyperplane has consequences; our decision would govern how we classify the new point shown as \(\mathsf{x}\) above, for example. Here are two ways to visualize the margin provided by various slopes and positions of sample hyperplanes:

Classifier margin for various hyperplane slopes and positions. Left: minimum margin shown as a circular region of safety for the point most susceptible to misclassification. Right: minimum margin shown as a bound on either side of the hyperplane. Larger margins are always preferred.

In 2D or higher, we could start by selecting the midpoint between the two closest points (which worked well in the 1D case), but it's not necessarily clear yet how to set the slope to avoid passing close to—or much worse, misclassifying—other points. So we'll need some mathematical guidance for choosing the best hyperplane parameters, and this guidance will involve extending our optimization rules. This aspect is discussed in the next section.

Formulating the classification problem in 2D and higher

Let's take a moment to summarize. Hopefully both the intuitive approach (maximize the margin) and the corresponding optimization algorithm (minimize \(w\) while satisfying \(w(x_m-a)y_m\ge 1\)) appear straightforward for non-overlapping points on a line. To generalize this approach to groups of vectors in N-dimensional space, we first need to review some geometrical concepts.

Geometrical considerations, including the introduction of Hesse form

This discussion of how a support vector machine operates is necessarily going to include some geometrical considerations; for example, we'll need to address the distance between a point and a hyperplane in N-dimensional space.

We're likely familiar with expressing a line in the form of \(y=mx+c\) (or whatever parameters we're used to using). Here, \(x\) and \(y\) are plotted on the horizontal and vertical axes, respectively, \(m\) is the slope, or rise over run, and \(c\) is the \(y\) intercept. In many plots we see, the \(x\)-axis presents the independent variable (i.e., the one we can change), and the \(y\)-axis presents the dependent variable (i.e., the one we wish to understand). In materials science, for example, we might plot an alloy's yield strength (\(y\)-axis) as a function of its temperature (\(x\)-axis).

Now consider our classification problem with points in 2D space and a line of our choosing that maximizes the margin. Each set of points (which can also be thought of as vectors extending from the origin) now represents a set of input data to be classified in terms of a certain output. For example, we might wish to consider the temperature and pressure before classifying a solid-state phase as thermodynamically stable or not.

With the 2D points present in a classification problem, we don't really have dependent and independent axes; in our thermodynamic stability example, for instance, we should be able to swap the temperature and pressure axes without affecting the results. So instead of \(x\) and \(y\), let's use \(x_1\) and \(x_2\), respectively. And let's bring the variables over to one side to obtain the line equation \(mx_1-x_2+b=0\). This form is more symmetric and therefore more appropriate for our N-dimensional inputs that all have equal importance a priori.

Notably, we can write this equation as \(\boldsymbol{w}\cdot\boldsymbol{x}+c=0\), where \(\boldsymbol{w}=(m,-1)\), where \(\boldsymbol{x}\) is the vector pointing from the origin to a point, and where we've employed the dot product or inner product. This form is known as Hesse form.

Equivalently, we could write \(\boldsymbol{w}^\mathrm{T}\boldsymbol{x}+c=0\), where vectors are expressed in column form and \(\mathrm{T}\) represents the transpose. Still another representation is \(\langle\boldsymbol{w},\boldsymbol{x}\rangle+c=0\), where \(\langle\cdot{,}\cdot\rangle\) is called the inner product or scalar product.

The reverse is also possible: given a Hesse form \(\boldsymbol{w}\cdot\boldsymbol{x}+b=0\) or (in 2D) \(w_1x_1+w_2x_2+b=0\), we can easily determine that \(x_2=-\frac{w_1}{w_2}x_1-\frac{b}{w_2}\), which if plotted on an \(x_1\) vs. \(x_2\) Cartesian space corresponds to a line with slope \(m=-\frac{w_1}{w_2}\) and \(y\) intercept \(c=-\frac{b}{w_2}\). OK.

Now, I should say at this point that I find this shift to Hesse form very disconcerting regardless of how casually it's applied in support vector machine texts. Most texts and tutorials, even at the beginner level, simply state something like "We seek a hyperplane equation with input weights \(\boldsymbol{w}\) [these components are often called weights] and bias \(b\) that satisfies…" It's easy to get stuck on such framing because the Hesse form involves a locus of points composing the hyperplane. Here, we're no longer calculating a certain \(y\) from a given \(x\) (which is pretty straightforward) but rather seeking some amorphous group of points that satisfies the condition of the left-hand side equalling zero. So why is it that we use this formulation? Here are some of the key reasons:

  • With all variables brought to one side, we attain symmetry; no one type of input is privileged over another.

  • In addition, the dot product lets us apply all the machinery of linear algebra to an arbitrary number of inputs, expressed as \(\boldsymbol{x_m}\), each with a corresponding classification \(y_m\).

  • Another neat aspect is that \(\boldsymbol{w}\) is normal (i.e., perpendicular) to our hyperplane. This condition is useful because we'll be relying strongly on perpendicular distances when maximizing margins.

  • Algebraic expressions are readily available for calculating distances in terms of normal vectors to a plane. For example, it's useful to know that the distance \(d\) between a point \(\boldsymbol{x_m}\) and a hyperplane \(\boldsymbol{w}\cdot\boldsymbol{x}+b=0\) is \(d=\frac{|\boldsymbol{w}\cdot\boldsymbol{x_m}+b|}{||\boldsymbol{w}||}\), where \(||\boldsymbol{w}||=\sqrt{\boldsymbol{w}\cdot\boldsymbol{w}}=\sqrt{\boldsymbol{w}^\mathrm{T}\boldsymbol{w}}=\sqrt{\langle\boldsymbol{w},\boldsymbol{w}\rangle}\) is the length—also called the norm—of \(\boldsymbol{w}\).

  • Similar to the 1D case where we sought to satisfy \(w(x_m-a)y_m\ge 1\), we now seek to satisfy \((\boldsymbol{w}\cdot\boldsymbol{x_m}+b)y_m\ge 1\). We might compress this expression further by defining the augmented hyperplane and vectors \(\boldsymbol{\hat{w}}=(b,w_1,w_2,\dots,w_m)\) and \(\boldsymbol{\hat{x}_m}=(1,x_1,x_2,\dots,x_m)\), respectively, prepending \(b\) to \(\boldsymbol{\hat{w}}\) and \(1\) to \(\boldsymbol{\hat{x}_m}\) to give \((\boldsymbol{\hat{w}}\cdot\boldsymbol{\hat{x}_m})y_m\ge 1\) as a compact criterion for correct classification.

These reasons provide ample motivation for getting used to Hesse form, dot products, and the machinery of linear algebra!

We now generalize to the 2D case and beyond. As described for the 1D case, we require the hyperplane to not undercut any support vectors; therefore, we require \((\boldsymbol{w}\cdot\boldsymbol{x_m}+b)y_m\ge 1\). In addition, we seek to maximize the margin. In the 1D case, this aspect was as simple as minimizing the slope \(w\) of the line that acted as the hyperplane. What about the case of higher dimensions? Well, we already have a formula for the distance from a point to a hyperplane, so let's maximize the distances from the hyperplane \(\boldsymbol{w}\cdot\boldsymbol{x_m}+b=0\) to the points to be classified.

Using the distance formula introduced in the box above, we can say that we wish to maximize $$d=\frac{|\boldsymbol{w}\cdot\boldsymbol{x_m}+b|}{||\boldsymbol{w}||}.$$

Shown below is the distance for \(\boldsymbol{x_2}\), for example:


Distance between a point (here, the second point in the –1 class, described by \(\boldsymbol{x_2}\)) and a hyperplane (shown in gray and characterized by its normal vector \(\boldsymbol{w}\)). Note that when the axes are scaled differently (equivalently, when the plotting aspect ratio does not equal one), orthogonal vectors don't appear perpendicular; nevertheless, \(\boldsymbol{w}\) and the distance measurement are definitely orthogonal to the hyperplane.

More precisely, we wish to maximize the minimum distance across all the data points for each classification. In this case, we see that \(d\) for \(\boldsymbol{x_2}\) is larger than that for \(\boldsymbol{x_6}\); therefore, point \(\boldsymbol{x_6}\) is more important for this particular hyperplane slope and intercept. Of course, for a different slope and intercept, point \(\boldsymbol{x_2}\) might predominate. It's this type of iterative decision making that we seek to automate and delegate to the computer to implement in a fraction of a second.

(A side note: At the risk of causing confusion, this figure continues the trend of scaling the \(y\)-axis to conveniently plot all the points; as a result, orthogonal entities (such as the hyperplane and its normal vector \(\boldsymbol{w}\) don't appear perpendicular. This weirdness tripped me up for several hours as I worked on these figures until I remembered that scaling an axis keeps parallel lines parallel but does not necessary keep perpendicular lines perpendicular. I'm sticking with this presentation to emphasize this aspect to the reader—after all, we'll frequently encounter or employ plots with a plotting aspect ratio other than one, and we shouldn't be stymied by this result. The weight vector \(\boldsymbol{w}\) is, as always, orthogonal to the hyperplane and parallel to the distance over which \(d\) is measured.)

Back to the distance calculation. We already know that at the support vectors, the classification inequality turns into the equality \(|\boldsymbol{w}\cdot\boldsymbol{x_m}+b|=1\), as in the 1D case where the optimal (black) line predicted exactly –1 or 1 at the support vectors. Therefore, the numerator simplifies to 1 at the support vectors (which have yet to be determined rigorously), and we seek simply to maximize \(\frac{2}{||\boldsymbol{w}||}\) (the factor of 2 arises because the margin is \(\frac{1}{||\boldsymbol{w}||}\) on either side of the hyperplane) or equivalently to minimize \(\frac{1}{2}||\boldsymbol{w}||\) or half the length of the vector \(\boldsymbol{w}\), which is very similar to minimizing the scalar \(w\) in the 1D case. The outcome of this constrained optimization, as we hoped to achieve, is the canonical hyperplane with maximum margin.

Reiterating, our problem can be stated as follows: $$\mathrm{minimize~}\frac{1}{2}||\boldsymbol{w}||=\frac{1}{2}\sqrt{\boldsymbol{w}^\mathrm{T}\boldsymbol{w}}$$ $$\mathrm{subject~to~}(\boldsymbol{w}\cdot\boldsymbol{x_m}+b)y_m\ge 1\mathrm{~(or,~equivalently,~}(\boldsymbol{\hat{w}}\cdot\boldsymbol{\hat{x}_m})y_m\ge 1\mathrm{)}.$$

This is a quadratic programming (QP) optimization problem, for which multiple open-source solvers are available. (Because of the quadratic nature, we advantageously avoid the possibility of becoming trapped in non-global minima. The operational details of such solvers are outside the scope of this note.) For example, the qpsolvers package in Python, implemented as linked below, can accommodate this problem. This package satisfies the following conditions: $$\mathrm{minimize~}\frac{1}{2}\boldsymbol{u}^\mathrm{T}\boldsymbol{Pu}+\boldsymbol{q}^\mathrm{T}\boldsymbol{u},$$ $$\mathrm{subject~to~}\boldsymbol{Gu}\le\boldsymbol{h}.$$

Careful inspection reveals that if we take \(\boldsymbol{u}\) as \(\boldsymbol{\hat{w}}\), then \(\boldsymbol{P}\) should be the identity matrix with the (1,1) element set to zero (to remove \(b\)), that \(\boldsymbol{q}=0\), that the signs of \(\boldsymbol{G}\) and \(\boldsymbol{h}\) need to be flipped because the inequality is the reverse of what we want, and that we must carefully assemble \(\boldsymbol{G}\) from our knowledge of \(\boldsymbol{x_m}\) (where the input data were transformed in this case, so \(\boldsymbol{x_m}=\phi(x_m)\)) and \(y_m\). The coefficient of \(\frac{1}{2}\) and the square root are irrelevant in a minimization term. That's all there is to it!

Recall that our input data are $$(x_m,y_m)=((2,-1),(3,-1),(6,1),(6.5,1),(8.3,1),(10.5,-1)),$$ which we transformed to $$(\boldsymbol{x_m},y_m)=(((2,25),-1),((3,16),-1),((6,1),1),\\((6.5,0.25),1),((8.3,1.69),1),((10.5,12.25),-1))$$ to achieve linear separability. In equation form, we therefore have

$$\boldsymbol{\hat{w}}=\begin{bmatrix}b \\ w_1\\w_2\end{bmatrix},$$

$$\boldsymbol{P}=\begin{bmatrix}0 & 0 & 0\\0 & 1 & 0\\0 & 0 & 1\end{bmatrix},$$

$$\boldsymbol{q}=\begin{bmatrix}0 \\0 \\0 \end{bmatrix},$$

$$\boldsymbol{G}=-\begin{bmatrix}-1 & -2 & -25 \\ -1 & -3 & -16 \\1 & 6 & 1 \\1 & 6.5 & 0.25\\ 1 & 8.3 & 1.69 \\ -1 & -10.5 & -12.25\end{bmatrix},$$

$$\boldsymbol{h}=-\begin{bmatrix}1 \\1 \\1\\1\\1\\1 \end{bmatrix}.$$

Take care to understand how matrix \(\boldsymbol{G}\) is constructed to satisfy the inequality constraint.

Implementing this code in Python, we obtain a solution of \(\boldsymbol{\hat{w}}=[b,\,w_1,\,w_2]=[1.64,\,-0.038,\,-0.182]\), corresponding to the following hyperplane:


Optimal hyperplane for our transformed data as calculated via QP optimization. The margin on either side of the line is \(1/||\boldsymbol{w}||\) (here, 5.39); no larger value is possible. The support vectors are circled; they're the points that exactly satisfy \((\boldsymbol{\hat{w}}\cdot\boldsymbol{\hat{x}_m})y_m=1\). The dashed line and normal vector \(\boldsymbol{w}\) are both orthogonal to the hyperplane (but not perpendicular in this plot because of axis scaling).

Calulating \((\boldsymbol{\hat{w}}\cdot\boldsymbol{\hat{x}_m})y_m\) for each vector \(\boldsymbol{x_m}\), we find that \(\boldsymbol{x_5}\) and \(\boldsymbol{x_6}\) give outputs of exactly 1, therefore identifying them as the support vectors, shown encircled in the following animation:


Animated view of the optimal hyperplane, which meets the support vectors at the classification results of \(y=-1\) and \(y=1\). (The \(\phi\) notation is omitted for clarity, but don't forget that the original \(x_m\) points were \(\phi\)-transformed to achieve linear separability.) To do: use black for the hyperplane.

If we were to undo our transformation from 2D back to 1D, we'd find that the hyperplane reduces to two division points that perfectly classify our input data:


A single linear separator in 2D reduces in this case to two boundaries that correctly classify overlapping 1D data. To do: use en dash for the class.

Great—we're now in good shape to analyze linearly separable data in N-dimensional space in general. Of course, we don't yet know how to select an arbitrary transformation \(\phi\) to fix overlapping data (recall that I arbitrarily chose \(\phi(x)=(x,(x-7)^2)\) to push the red and blue points well away from each other), but we're making progress.

Equivalence to convex hulls

If you're especially visually intuitive, you may have discerned that the hyperplane optimization problem in N-dimensional space is equivalent to maximizing the distance between the hyperplane and the convex hull of each class. For instance, here are the associated convex hulls and distances for our problem:


The margin maximization problem for the N-dimensional case is equivalent to identifying the midpoint between the convex hulls between the two classes. Convex hulls are shown in dashed lines according to each class color, and the shortest distance appears as the black dashed line.

This strategy clarifies how to proceed in cases where there are multiple support vectors in a single class. If one of our red triangles were moved from \((3.16)\) to \((3.5,12.25)\) for example, then it would join the red triangle at \((10.5,12.25)\) in the list of the support vectors, and the maximized margin would be the distance between the hyperplane and any position on the adjacent convex hull segment:


As \(x_2\) is increased, its transformed point in \((x_1,x_2)\) space ultimately impinges on the existing margin and therefore becomes a support vector that needs to be considered during hyperplane determination. The relevant margin now extends from one support vector to a segment between two support vectors.

If we further move the \(x\) point from 3.5 to 3.8, for example, we actually end up with 4 support vectors because the changing hyperplane slope makes \(x_3\) relevant. (Try verifying this condition on the Python code linked above.)

Part 1 ends here. Part 2 will address the dual formulation, the kernel trick, slack variables, cross validation, and the transition from support vector classification to support vector regression.


Copyright John M. Maloney