John M. Maloney - Support vector regression applied to Raman spectra (part 1)

For our 1D data, let’s call the specific value of the selected division point (equivalently, the threshold or boundary condition) \(a\). We intuitively chose a threshold of \(a=4.5\) (i.e., line B, which lies midway between points at 3 and 6), representing the midpoint between the farthest-right red triangle and the farthest-left blue square, which are circled below.

We arrive at an important point: the position of the maximum margin linear classifier depends only on those circled points (vectors in the N-dimensional case). (The positions of the other points/vectors aren’t important except inasmuch as they are correctly classified.) Because these key vectors provide the foundation for the classification procedure—they support classification—they are termed support vectors:

When we divide non-overlapping classes, only the inputs nearest the division point are relevant and are designated support vectors because of their importance. Other inputs—as long as they are correctly classified—simply don’t matter.

Wow, we can now consider ourselves to be performing support vector classification with very little conceptual burden. The key point to remember is that the support vectors predominate in determining the position of the separating classifier.

Now, one way to classify a new data point \(x\) would be to check whether it’s greater than or less than \(a\): do we have \(x\lt a\), or do we have \(x\gt a\)?

But an alternate approach, well suited for classifications comprising –1 and +1, would be to simply calculate the sign of \(x-a\), or \(\mathrm{sgn}(x-a)\). Note that this approach gives the classification (of –1 or +1) automatically, which is convenient. In fact, the sign of any positive multiple of \((x-a)\) would work, corresponding to different lines (hyperplanes in the N-dimensional case) passing through the division point. We could write the hyperplane equation as \(y=w(x-a)\), for example, for positive \(w\):

The sign function facilitates automatic classification of input values. Shown in gray are three candidate hyperplanes that all correctly classify the two groups once the sign function is applied; the dashed black line shows their common predictive output.

Notably, only one hyperplane passes through the support vectors, and this special hyperplane is called the canonical or optimal hyperplane:

The canonical hyperplane (black line) satisfies the correct classification result of –1 and +1 exactly at the support vectors. We seek a way to calculate the details of this hyperplane automatically.

Up to this point, we’ve used intuition to maximize the margin between groups. Let’s now try working backward: taking a clue from the canonical hyperplane shown above, let’s require that our separator satisfies \(w(x_m-a)\le -1\) for every red triangle (classification \(y_m=-1\)) and \(w(x_m-a)\ge 1\) for every blue square (classification \(y_m=1\)). Hey, wait a second, it’s more efficient to write simply \(w(x_m-a)y_m\ge 1\), which applies to all points. So that’s clever.

(Here, \(x_m\) and \(y_m\) refer to the specific data points; numbering them from left to right, we’d designate the red triangles as \(x_1\) and \(x_2\) and the blue squares as \(x_3\), \(x_4\), and \(x_5\), so that the input data are \((x_m,y_m)=((2,-1),(3,-1),(6,1),(6.5,1),(8.3,1))\). Clear?)

Hyperplanes for divisions A, B, and C. A systematic way to obtain the canonical hyperplane (shown in black) is to ask which hyperplane exhibits the minimum slope while satisfying the necessary classification criterion \(w(x_m-a)y_m\ge 1\). (Plotted in Python with this code.)

Expressed in words, we’re requiring not only that the sign of the hyperplane predicts the correct classification but also that the hyperplane never undercuts the support vectors. For example, the hyperplane corresponding to division A is constrained to pass through the far-right red triangle, and the hyperplane corresponding to division C is constrained to pass through the far-left blue square. Note that the slope of the hyperplane is lower when it passes through the optimal division B than when it passes through the inferior divisions at A and C; furthermore, note that the B hyperplane passes through both support vectors.

So we’ve found (in 1D for the linearly separable case, at least) that minimizing \(w\) while satisfying \(w(x_m-a)y_m\ge 1\) works as a rigorous way to find the maximum margin linear classifier. It turns out that this aspect still holds when we extend the situation to 2D and higher: we’ll still seek to minimize the effective slope \(w\) of the hyperplane while never undercutting a support vector.

Let’s now consider the complication of input points unavoidably lying on the “wrong side” of the division or threshold, which will force us to extend the 1D case to higher dimensions.

Accommodating overlapping classes through transformation

We’ve seen that non-overlapping classes (equivalently, linearly separable classes) in 1D are relatively straightforward to work with because the general position of the divider is intuitive; perfect classification is assured. But what if the classes are overlapping?

For overlapping classes, we have at least two approaches at our disposal: we can transform the data to be non-overlapping, or we can accommodate the misclassification through so-called slack variables (to be covered in part 2).

As an example of transformation, consider the following input data, which extends our existing data set by adding another red triangle on the right at \(x=10.5\):

An additional data point on the “wrong side” removes the feature of linear separability, forcing us to look for other solutions.

The presence of this new data point in the “–1” class precludes error-free classification from a simple single threshold. Notably, however, we could plot the data in a higher-dimensional space in which the \(x\)-axis still plots \(x\) and a second axis plots, say, \((x-7)^2\). Let’s call this transformation \(\phi\); whereas we once plotted \(x_m\) on a 1D \(x\)-axis, we now plot the vectors \(\boldsymbol{x_m}=\phi(x_m)=\left(x_m,(x_m-7)^2\right)\) on a 2D plane:

To separate the –1 and +1 classes, we could transform our original scalar \(x\) data to the corresponding \(\phi(x)\) data that are actually separable. This process is as simple as adding a second axis that plots \((x-7)^2\), which is chosen simply by eye in this case. Gray lines represent potential hyperplanes, or decision criteria, that correctly classify the now linearly separable data. The choice of hyperplane affects the classification of a potential future data point ×, so we really should decide how and what it means to optimize this process.

Linear separability returns! (At the cost of some additional mathematical machinery.)

This technique is remarkable, no? It suggests that with a suitably customized transformation, we could accommodate any additional data points that would hinder making a decision in the original space. (For the 1D case at least, this process is as simple as applying some type of polynomial transformation and plotting the results on a second axis. For the N-dimensional case, we could imagine using some sufficiently sophisticated function.)

The process of specifying such a transformation has been likened to having the data points distributed across a bedsheet and flicking the sheet in such a way that the different classes are now successfully separated. And one could imagine that a sufficiently complex flicking process could separate any collection of intermixed data. (But is our transformation needlessly complex to accommodate the details of the training data, without necessarily fitting the subsequent data successfully? This will be a topic of interest later when we discuss overfitting.)

Of course, it’s true we don’t know yet what slope and position we should use for the optimal decision line or hyperplane (except inasmuch as we can expect that we should maximize the margin, as before). And the choice of hyperplane has consequences; our decision would govern how we classify the new point shown as × above, for example. Here are two ways to visualize the margin provided by various slopes and positions of sample hyperplanes:

In 2D or higher, we could start by selecting the midpoint between the two closest points (which worked well in the 1D case), but it’s not necessarily clear yet how to set the slope to avoid passing close to—or worse, misclassifying—other points. So we’ll need some mathematical guidance for choosing the best hyperplane parameters, and that guidance will involve extending our optimization rules. This aspect is discussed in the next section.

Formulating the classification problem in 2D and higher

Let’s take a moment to summarize. I hope that both the intuitive approach (maximize the margin) and the corresponding optimization algorithm (minimize \(w\) while satisfying \(w(x_m-a)y_m\ge 1\)) appear straightforward for non-overlapping scalar groups that can be plotted on a line. To generalize this approach to (potentially overlapping) groups of vectors in N-dimensional space, we first need to review some geometrical concepts.

Geometrical considerations, including the introduction of Hesse form

This discussion of how a support vector machine operates is necessarily going to include some geometry-intensive considerations; for example, we’ll need to address the distance between a point and a hyperplane in N-dimensional space.

We’re likely familiar with expressing a line in the form of \(y=mx+c\) (or whatever parameters we’re familiar with; note that the slope \(m\) is distinct from the index (\(_m\)) that we’ve been using to number the inputs). Here, \(x\) and \(y\) are plotted on the horizontal and vertical axes, respectively; \(m\) is the slope, or rise over run; and \(c\) is the \(y\) intercept. In the standard plot, the \(x\)-axis presents the independent variable (i.e., the one we can change), and the \(y\)-axis presents the dependent variable (i.e., the one we wish to understand). In materials science, for example, we might plot an alloy’s yield strength (\(y\)-axis) as a function of its temperature (\(x\)-axis). Or the stiffness of a hydrogel as a function of the surrounding salinity.

Consider our classification problem with points in 2D space and a line that we wish to choose to maximize the margin. Each set of points (which can also be thought of as vectors extending from the origin) now represents a set of input data to be classified in terms of a certain output. For example, we might wish to consider the temperature and pressure before classifying a solid-state phase as thermodynamically stable or not.

With the 2D points that arise in a classification problem, we don’t really have dependent and independent axes; in our thermodynamic stability example, for instance, we should be able to swap the temperature and pressure axes without affecting the results. So instead of \(x\) and \(y\), let’s use \(x_1\) and \(x_2\). And let’s bring the variables over to one side to obtain the line equation \(mx_1-x_2+b=0\). This form is more symmetric and therefore more appropriate for our N-dimensional inputs that all have equal importance a priori.

Notably, we can rewrite this equation \(mx_1-x_2+b=0\) as \(\boldsymbol{w}\cdot\boldsymbol{x}+c=0\), where \(\boldsymbol{w}=(m,-1)\) and where \(\boldsymbol{x}\) is the vector pointing from the origin to a certain point; here, we’ve employed the dot product \(\boldsymbol{w}\cdot\boldsymbol{x}=w_1x_1+w_2x_2\). This form is known as Hesse form.

Equivalently, we could write \(\boldsymbol{w}\cdot\boldsymbol{x}=\boldsymbol{w}^\mathsf{T}\boldsymbol{x}\), where all vectors are expressed in column form and \(\mathsf{T}\) represents the matrix transpose; here, we’re performing matrix multiplication with vectors.

(Still another equivalent representation is \(\langle\boldsymbol{w},\boldsymbol{x}\rangle\), where \(\langle\cdot{,}\cdot\rangle\) is called the inner product or scalar product, but we’ll stick to the dot notation and the transpose notation. In part 2, another equivalent representation appears: \(\sum_m w_m x_m\), termed index notation.)

The reverse is also possible: given an equation in Hesse form \(\boldsymbol{w}\cdot\boldsymbol{x}+b=0\), or (in 2D) \(w_1x_1+w_2x_2+b=0\), we can easily determine that \(x_2=-\left(\frac{w_1}{w_2}\right)x_1-\frac{b}{w_2}\), which if plotted on an \(x_1\) vs. \(x_2\) Cartesian space corresponds to a line with slope \(m=-\frac{w_1}{w_2}\) and \(y\) intercept \(c=-\frac{b}{w_2}\). OK.

Now, I should say at this point that I find this shift to Hesse form very disconcerting regardless of how casually it’s applied in support vector machine texts. Most texts and tutorials, even at the beginner level, simply state something like "We seek a hyperplane equation with input weights \(\boldsymbol{w}\) [these components are often called weights] and bias \(b\) that satisfies…" It’s easy to get stuck on such framing because the Hesse form involves a locus of points composing the hyperplane. Here, we’re no longer calculating a certain \(y\) from a given \(x\) (which is straightforward) but rather seeking some amorphous group of points that satisfies the condition of the left-hand side equaling zero. So why is it that we use this formulation? Here are some of the key reasons:

With all variables brought to one side, we attain symmetry; no one type of input is privileged over another.
In addition, the dot product lets us apply all the machinery of linear algebra to an arbitrary number of N-dimensional inputs, expressed as \(\boldsymbol{x_m}\), each with a corresponding classification \(y_m\). If you've ever taken a class on linear algebra, you might recall how some of this machinery provides the ability to solve many equations simultaneously.
Another neat aspect is that \(\boldsymbol{w}\) is normal (i.e., perpendicular) to our hyperplane. This condition is useful because we’ll be relying strongly on perpendicular distances when maximizing margins.
Algebraic expressions are readily available for calculating distances in terms of normal vectors to a plane. For example, it’s useful to know that the distance \(d\) between a point \(\boldsymbol{x_m}\) and a hyperplane \(\boldsymbol{w}\cdot\boldsymbol{x}+b=0\) is \(d=\frac{|\boldsymbol{w}\cdot\boldsymbol{x_m}+b|}{||\boldsymbol{w}||}\), where \(||\boldsymbol{w}||=\sqrt{\boldsymbol{w}\cdot\boldsymbol{w}}=\sqrt{\boldsymbol{w}^\mathsf{T}\boldsymbol{w}}\) is the length—also called the norm—of the vector \(\boldsymbol{w}\).
Similar to the 1D case where we sought to satisfy \(w(x_m-a)y_m\ge 1\), we now seek to satisfy (by analogy) \((\boldsymbol{w}\cdot\boldsymbol{x_m}+b)y_m\ge 1\) (so \(w\) and \(a\) get slightly shuffled into \(\boldsymbol{w}\) and \(b\), but the general idea stays the same, with the key difference being that the scalar \(w\) changes to the vector \(\boldsymbol{w}\)). We might compress this expression further by defining the augmented hyperplane and vectors \(\boldsymbol{\hat{w}}=(b,w_1,w_2,\dots,w_m)\) and \(\boldsymbol{\hat{x}_m}=(1,x_1,x_2,\dots,x_m)\), respectively, prepending \(b\) to \(\boldsymbol{\hat{w}}\) and \(1\) to \(\boldsymbol{\hat{x}_m}\) to give \((\boldsymbol{\hat{w}}\cdot\boldsymbol{\hat{x}_m})y_m\ge 1\) as a compact criterion for correct classification. Try implementing this dot product to confirm that the results are the same as before.

These reasons provide ample motivation for getting used to Hesse form, dot products, the matrix transpose, and the other machinery of linear algebra!

Maloney Home Page | Support vector regression applied to Raman spectra (part 1)

Context

Strategy

Specific motivation

Definition of key terms

1D classification example: Intuitive approach

1D classification example: Developing a systematic approach

Accommodating overlapping classes through transformation

Formulating the classification problem in 2D and higher