John M. Maloney - Support vector regression applied to Raman spectra (part 2)

To learn how to design our transformation function $\phi$ (or, as it turns out, to sidestep this issue!), it's necessary to transition from the so-called primal form of the optimization problem to the so-called dual form. The primal form was derived in part 1: $$\mathrm{minimize~}\frac{1}{2}||\boldsymbol{w}||^2;$$ $$\mathrm{subject~to~}(\boldsymbol{w}\cdot\boldsymbol{x_m}+b)y_m\ge 1.$$

That is, we seek to maximize the margin (the first expression) while never undercutting a support vector (the second expression). Earlier, we found that maximizing the margin is equivalent to minimizing $\frac{1}{2}||\boldsymbol{w}||$, and then we squared $||\boldsymbol{w}||$ for convenience to make it more easily expressible as $||\boldsymbol{w}||^2=\boldsymbol{w}^\mathrm{T}\boldsymbol{w}=\boldsymbol{w}\cdot\boldsymbol{w}$. Remember that in our case, $\boldsymbol{x_m}=\phi(x_m)$, where $x_m$ is simply a list of numbers $(2,3,6,6.5,8.3,10.5)$ with associated classifications $y_m=(-1,-1,1,1,1,-1)$. The vector $\boldsymbol{w}$ and the number $b$, which are termed the weights and bias, respectively, describe the hyperplane that suitably classifies our data.

Continued progress is necessarily going to grow relatively math intensive. Fortunately, the first aspect to explore, Lagrange multipliers, is part of the undergrad engineering and science curriculum and can be understood intuitively through graphical examples.

Briefly, Lagrange multipliers are useful when we wish to maximize one function ($f$, say) while satisfying a constraint involving another function ($g=g_0$, say, where $g$ is a function and $g_0$ is a constant). Let's immediately show a visual example for clarity:

We seek the largest value of $f$ (shown in blue, with contour loops that get darker with increasing $f$) with the constraint that we must stay on the orange planar curve corresponding to the constraint $g=g_0$. What point would you select? Does a systematic approach exist? How would you automate this approach? (Produced in Python with this code.)

To obtain a sense of what it means to maximize a function subject to a constraint, assume that our function $f$ takes as input a 2D location on the $x$ and $y$ axes and returns a number on the $z$ axis, analogous to an elevation or altitude. We pursue this analogy by plotting the contour map of $f$ above (blue contours), similar to a trail or topographic map, where the output altitude increases in this case as we move to inner contours. Simultaneously, the constraint $g=g_0$ (the orange planar curve) governs where we are allowed to move. If it weren't for this constraint, of course we'd just pick the peak of $f$'s blue mountain; the constraint makes the solution more elusive.

Let's identify a few important aspects of this plot. One is that the gradient of $f$, or $\nabla f$ (essentially the multidimensional vector version of a derivative, defined in this context as $\nabla=\boldsymbol{\hat{x}}\frac{\partial}{\partial x}+\boldsymbol{\hat{y}}\frac{\partial}{\partial y}$, where $\hat{x}$ points in the $x$ direction and $\hat{y}$ in the $y$ direction) always points toward increasing $f$ and also always points perpendicular to the $f$ contour lines, as shown by the vector arrows below, selected for two particular contours of $f$:

Arrows show the direction of the gradient, which always points perpendicular to contour lines and functions. Blue arrows show the directions of the greatest increase of $f$ for two contours, and orange arrows show the gradient directions for various locations on the constraint.

The gradient as the direction of maximum increase

We seek to show that the gradient of a function (e.g., at a contour curve) always points in the direction of the maximum increase of that function.

Taylor series expansions are generally awesome because they let us approximate $f(r+\delta r)$ for small $\delta r$ as

$$f(r+\delta r)=f(r)+f^\prime(r)\delta r+\frac{1}{2!}f^{\prime\prime}(r)(\delta r)^2+\cdots\\+\frac{1}{n!}f^{(n)}(r)(\delta r)^n+\cdots,$$

where the prime symbol indicates the derivative with respect to $r$ (and the superscript $(n)$ indicates the nth derivative). The vector version, relevant here, is

$$f({\boldsymbol{r}}+{\delta\boldsymbol{ r}})\approx f({\bf r})+(\nabla f)\cdot{\delta\boldsymbol{r}},$$

where the gradient replaces the derivative and the dot product replaces multiplication and where we've dropped all additional terms as insignificant. Since the dot product effectively measures whether two vectors point in similar directions, the last term is maximized when $\delta\boldsymbol{r}$ points in the same direction as $\nabla f$, thus maximizing $f({\boldsymbol{r}}+{\delta\boldsymbol{ r}})$, which is equivalent to saying that the direction of the maximum increase of $f$ is that of $\nabla f$.

Regarding the plot above, note that the gradient of the constraint function $g$, namely, $\nabla g$, is also perpendicular to the path but in this case shows where we can't go—we're constrained to stay on that path.

Now, there's a contour of $f$ that tangentially intersects the $g=g_0$ constraint, and it's not a coincidence that the gradients of the two curves are parallel at that intersection, which of course is the point we seek—the point that maximizes $f$ while satisfying $g=g_0$:

Arrows show the directions of the gradients of $f$ (blue arrows) and $g$ (orange arrows) at various points on the constraint $g=g_0$ (orange curve). The point we seek—which lies on the special contour in dark blue—is the one on the constraint (orange line) where the gradient of the constraint is parallel to the gradient of $f$. It's at this point that $f$ is maximized while satisfying $g=g_0$. This approach provides a systematic way to achieve our objective.

In fact, you can't draw a smooth tangential intersection that doesn't satisfy this property. The mathematical way of expressing that these two gradients $\nabla f$ and $\nabla g$ are parallel is

where $\alpha$ (termed a Lagrange multiplier in this context) is just a scaling constant. In general, $\alpha$ can be positive or negative. (It would be negative for the condition shown above because at the optimum point, the blue and orange arrows point in opposite directions.)

Here's another way to think of the problem: Since $\nabla f$ points in the direction of the greatest change, as shown in the box above, we naturally wish to move in this direction to achieve our goal of maximizing $f$. But when our desired direction $\nabla f$ is parallel to $\nabla g$, then we can't do any better. We're stuck: our desired direction is perpendicular to the path that satisfies $g=g_0$. So we're done—we've optimized our function subject to our constraint.

It's a common theme of these technical notes that we'd like to apply clever results to both move forward and gain engineering intuition. In this case, we could cleverly obtain the same equation and satisfy the original constraint of $g=g_0$ if we were to proceed as follows:

The Lagrange multiplier method

To maximize $f$ while satisfying $g=g_0$, let's re-express the latter relationship as $g-g_0=0$ and write

$$\mathcal{L}=f-\alpha(g-g_0)$$

(where ${\cal{L}}$ is termed the Lagrangian) and maximize this expression over all $x$ and $y$ (written as $\max_{x,y}{\cal{L}}$), given that we'll commit to applying the following two operations:

Apply the gradient operator to ${\cal{L}}$ and set the result to zero, which recovers the equation $\nabla f=\alpha\nabla g$ (as described above) that states that the gradients are parallel. Recall that the gradient is just a set of spatial derivatives, so this process is equivalent to setting the derivative with respect to $x$ to zero, then with respect to $y$.
Set the derivative of ${\cal{L}}$ with respect to $\alpha$ to zero, or $\frac{\partial {\cal{L}}}{\partial\alpha}=0$, which gives the constraint exactly.

You might note that we can accommodate any number of additional constraints $g_m=g_{m,0}$ by simply appending $\sum_m\alpha_m(g_m-g_{m,0})$ to the Langrangian. Wow, I've got chills.

Aren't these two steps equivalent to just individually setting the derivative with respect to every input dimension and every Lagrange multiplier to zero? Yes, exactly, and this is a great way to think about it. In fact, we can express this operation succinctly as

$$\nabla_{x,y,\alpha}{\cal{L}}=0.$$

Remember also that we further wished to obtain $\max_{x,y}{\cal{L}}$. But wait: doesn't that operation encompass solving $\nabla_{x,y}{\cal{L}}=0$ anyway because maximizing or minimizing a function involves finding where its derivatives are zero? Yes, except for the case where the maximum lies at infinity, so we would need to throw away those solutions as nonviable. So we find that we can write

$$\max_{x,y}{\cal{L}}~\mathrm{such~that}~\frac{\partial{\cal{L}}}{\partial\alpha}=0.$$

But then we might as well simply write

$$\max_{x,y,\alpha}{\cal{L}}$$

because maximizing any function over $\alpha$ would entail setting its derivative to zero as well (again, given that we throw away solutions at infinity).

To reiterate, we're happy to allow the Lagrange multiplier $\alpha$ to assume any value necessary to satisfy the equations obtained from the steps listed above—except infinity. A result of infinity would correspond to no solution.

(This is such an important point that it's useful to re-express it. We can't have an $x,y$ solution that entails $g(x,y)\ne g_0$ because that condition violates our fundamental constraint. If $g > g_0$, then our goal of maximizing the Lagrangian ${\cal{L}}$ could be met with $\alpha=\infty$, which we've decided is an invalid solution. Conversely, if $g < g_0$, then we could maximize the Lagrangian ${\cal{L}}$ with $\alpha=-\infty$, which we've also ruled unacceptable. In this way, the Lagrange multiplier acts as both an essential solution tool and a diagnostic of our optimization solution.)

Thus, the Lagrange multiplier method turns an optimization problem with constraints into several simultaneous equations obtained from steps (1) and (2) that we're free to solve. In turn, the transition from writing $\nabla_{x,y,\alpha}{\cal{L}}=0$ to writing $\max_{x,y,\alpha}{\cal{L}}$ allows us to accommodate inequality constraints, which is a very powerful capability. For example, if the constraint were instead $g\ge g_0$, then we could still use the same Lagrangian as long as we modified the maximization term to read

$$\max_{x,y,\alpha\ge 0}{\cal{L}}.$$

In other words, we previously wished the Lagrange multiplier $\alpha$ to swing to $+\infty$ or $-\infty$ as necessary to invalidate any ${\cal{L}}$-optimizing solution involving $g\ne g_0$ because that condition would constitute a constraint violation. Now, however, in the inequality case, we're taking away the capacity for $\alpha$ to become arbitrarily negative when we maximize ${\cal{L}}$ because it's fine if $g\ge g_0$.

Isn't this modification clever? Familiarity with this reasoning will enable forward movement.

Let's apply this new familiarity to our problem. By comparing the Lagrangian with the primal form stated above, we can link $f$ (the function to be maximized or minimized) to $\frac{1}{2}||\boldsymbol{w}||^2=\frac{1}{2}(\boldsymbol{w}\cdot\boldsymbol{w})$. And we can link the canonical constraint term $g-g_0$ to our constraint as $(\boldsymbol{w}\cdot\boldsymbol{x_m}+b)y_m-1$. The Lagrangian—customized for our optimization problem, which itself is customized for our classification problem—is therefore

$${\cal{L}}(\boldsymbol{w},b,\alpha)=\frac{1}{2}(\boldsymbol{w}\cdot\boldsymbol{w})-\sum_m\alpha_m\left[(\boldsymbol{w}\cdot\boldsymbol{x_m}+b)y_m-1\right].$$

Because our constraint is expressed as an inequality, we'll maximize ${\cal{L}}(\boldsymbol{w},b,\alpha)$ over all nonnegative $\alpha$, as discussed in the box above:

And since we specifically wish to minimize $\frac{1}{2}||\boldsymbol{w}||^2$ by adjusting $\boldsymbol{w}$ and $b$, we write as our final form

$$\min_{\boldsymbol{w},b}\max_{\alpha\ge 0}{\cal{L}}(\boldsymbol{w},b,\alpha).$$

We're now ready to differentiate this Lagrangian with respect to our various parameters (as specified in the box above) and see what we find. First, differentiating with respect to $\boldsymbol{w}$ and setting the result to zero gives

$$\frac{\partial{\cal{L}}}{\partial\boldsymbol{w}}=0\Rightarrow \boldsymbol{w}-\sum_m\alpha_m\boldsymbol{x}_m y_m=0,$$

(If differentiating the dot product of vectors is new, try expressing the dot product $\boldsymbol{w}\cdot\boldsymbol{w}$ as $\sum_i w_iw_i=\sum_i w_i^2$ and bringing the differential operator inside the summation sign. The denominator-layout nomenclature is used.)

Second, differentiating with respect to $b$ and setting the result to zero gives us

Finally, it's insightful to verify that if we differentiate the Lagrangian with respect to $\alpha_m$, then we obtain our original constraint (or at least the equality part of it—the inequality is addressed by our use of $\max_{\alpha\ge 0}$).

We've just done everything required to apply the Lagrange multiplier machinery; what remains is to work through the implications and revel in the outcome.

$${\cal{L}}(\boldsymbol{w},b,\alpha)=\frac{1}{2}(\boldsymbol{w}\cdot\boldsymbol{w})-\sum_{m}\alpha_m(\boldsymbol{w}\cdot\boldsymbol{x_m})y_m-b\sum_{m}\alpha_my_m+\sum_{m}\alpha_m.$$

The second differentiation outcome above tells us immediately that the third term on the right-hand side is zero, and the first informs us that the first term,

$$\frac{1}{2}(\boldsymbol{w}\cdot\boldsymbol{w})=\frac{1}{2}\sum_{m}\sum_{n}\alpha_m\alpha_ny_my_n(\boldsymbol{x_m}\cdot\boldsymbol{x_n}),$$

$$-\sum_{m}\alpha_m(\boldsymbol{w}\cdot\boldsymbol{x_m})y_m=-\sum_{m}\sum_{n}\alpha_m\alpha_ny_my_n(\boldsymbol{x_m}\cdot\boldsymbol{x_n}).$$

(If the transition from the dot product to indices seems opaque, it may help to review the definition of the dot product.) Thus, we have

$$\max_{\alpha\ge 0}{\cal{L}}(\alpha)=-\frac{1}{2}\sum_{m}\sum_{n}\alpha_m\alpha_ny_my_n(\boldsymbol{x_m}\cdot\boldsymbol{x_n})+\sum_{m}\alpha_m,$$

the minimization part having disappeared because the parameters $\boldsymbol{w}$ and $b$ no longer appear. This last expression is the dual form.

As I cautioned, this content has gotten very math intensive. There's a big payoff:

The kernel trick

Importantly, our latest version of the optimization problem, the dual form, includes only the dot products of the input data. Now, the outputs of dot products can be much simpler than their constitutive vectors! Consider, for example, how

$$(2\boldsymbol{i}+3\boldsymbol{j}+4\boldsymbol{k})\cdot(5\boldsymbol{i}+6\boldsymbol{j}+7\boldsymbol{k})$$

works out to simply $2\cdot 5+3\cdot 6+4\cdot 7=56$, which we can interpret as an objective measure of how much those two vectors point in the same direction, given their length.

In part 1, overlapping classes prompted us to transform our data from $x_m$ to $\boldsymbol{x_m}=\phi(x_m)$. That is, we decided to apply a transformation $\phi$ to present our data in a higher-dimensional space to achieve linear separability. We didn't know (and still don't know) how to select the best $\phi$ to use; we selected $\phi(x)=(x,(x-7)^2)$, but this transformation was straightforward and robust in the sense that we could have subtracted 6.8 or 7.2, for example.

Our latest result indicates that we now need only manipulate $\phi(x_m)\cdot\phi(x_n)$. The corresponding function for our particular transformation is

$$\phi(x_m)\cdot\phi(x_n)= x_m x_n+(x_m-7)^2(x_n-7)^2,$$

which simply outputs a number. We can represent this scalar function as $k(x_m,x_n)$, which is called the kernel. (The example shown above is called a decomposable kernel because it's clear what transformation was performed: add a second dimension in which we substract 7 from the data point, and then square the result.)

What if I told you that there are a variety of alternative kernel functions that we could have used that can successfully classify our groups—even our overlapping groups—seemingly on autopilot, without even needing to specify a particular transformation? I'm talking about functions such as

$$k(x_m,x_n)=(x_mx_n+1)^2$$

$$k(x_m,x_n)=\exp\left[-(x_m-x_n)^2\right]$$

$$k(x_m,x_n)=1-\frac{1}{1-|x_m-x_n|}$$

$$k(x_m,x_n)=\min(x_m,x_n),$$

for example, among others. This would be a big win. Happily*, this shortcut is possible and is known as the kernel trick. More on this topic as we work through the process of solving the dual problem (which we expect to produce the same results as solving the primal problem).

*English-language pedants: We are happy; the shortcut is not happy.

The potential advantages of the kernel trick—which are only hinted at above—arguably justify all the challenging mathematical manipulation that's been required up to now. Naturally, we wish to avoid having to develop a customized transformation $\phi$ for every new data set. We're very close to operating a true support vector machine, now understanding the details of what's actually happening.

For now, let's continue using the arbitrary $\phi(x_m)=(x_m,(x_m-7)^2)$ transformation that we designed earlier and show that we can solve the optimization problem in the dual form, obtaining the same result as the primal form. We start with the dual-form expression derived above, now incorporating the kernel:

$$\max_{\alpha\ge 0}{\cal{L}}(\alpha)\mathrm{,~where~}{\cal{L}}(\alpha)=-\frac{1}{2}\sum_{m}\sum_{n}\alpha_m\alpha_ny_my_nk(x_m,x_n)+\sum_{m}\alpha_m,$$

$$\mathrm{with~}\sum_m\alpha_my_m=0\mathrm{~(from~the~Lagrange~multiplier~process)},$$

$$\mathrm{maximize~}\left(-\frac{1}{2}\boldsymbol{\alpha}^\mathsf{T}\boldsymbol{y}\boldsymbol{y}^\mathsf{T}\boldsymbol{K}\boldsymbol{\alpha}+\boldsymbol{1}^\mathsf{T}\boldsymbol{\alpha}\right)\\\mathrm{(or,~equivalently,~minimize~}\frac{1}{2}\boldsymbol{\alpha}^\mathsf{T}\boldsymbol{y}\boldsymbol{y}^\mathsf{T}\boldsymbol{K}\boldsymbol{\alpha}-\boldsymbol{1}^\mathsf{T}\boldsymbol{\alpha}\mathrm{)},$$

where the kernel matrix $\boldsymbol{K}$ (also called the Gram matrix) is defined as $K_{mn}=k(x_m,x_n)$ and where $\boldsymbol{1}$ is a column vector of ones. It might be helpful—it certainly helps me—to write out these vectors and matrices for a low-dimensional case to see how the indicial notation and the vector notation match up.

Let’s return to the qpsolvers package in Python, applied in part 1, which, we recall, aims to satisfy the following conditions:

$$\mathrm{minimize~}\frac{1}{2}\boldsymbol{u}^\mathsf{T}\boldsymbol{Pu}+\boldsymbol{q}^\mathsf{T}\boldsymbol{u},$$

Comparing the two formulations, we find that we can associate $\boldsymbol{u}$ with $\boldsymbol{\alpha}$, $\boldsymbol{P}$ with $\boldsymbol{y}\boldsymbol{y}^\mathsf{T}\boldsymbol{K}$, $\boldsymbol{q}$ with $-\boldsymbol{1}$, $\boldsymbol{G}$ with $-\boldsymbol{I}$ (where $\boldsymbol{I}$ is the identity matrix, with the negative sign coming from the flipped inequality), $\boldsymbol{h}$ and $\boldsymbol{b}$ with $\boldsymbol{0}$, and $\boldsymbol{A}$ with $\boldsymbol{y}^\mathsf{T}$. This we can do.

The code is here. We obtain a solution of $\boldsymbol{\alpha}$ values (i.e., Lagrange multipliers) of $[0, 0, 0, 0, 0.01719, 0.01719]$.

We know from our previous solution of the primal problem that the input data points $x_5$ and $x_6$ are the support vectors in this case. So is $\alpha$ nonzero only for the support vectors? Absolutely. In fact, this is a classic characteristic of Lagrange multipliers: they're nonzero only when they come into play, when the constraint is being tested. Sound familiar? This rule of thumb is essentially how we think of the support vectors: they're the ones that really matter in the process of classification. To reiterate: support vectors are associated with nonzero Lagrange multipliers.

We can recover the corresponding so-called weights, or $\boldsymbol{w}$ vector, from the relationship derived earlier:

The output is $w_1=-0.038$ and $w_2=-0.182$, which exactly matches the values we calculated in part 1 by solving the primal form.

And of course we wish to calculate the so-called bias $b$. Recall that we chose to satisfy $\boldsymbol{w}\cdot\boldsymbol{x_m}+b=y_m$ exactly at the support vectors, as discussed in the canonical hyperplane section of part 1. And indeed, if we specifically use the support vector $\boldsymbol{x_5}$ (meaning $\left(x_5,(x_5-7)^2\right)$), then we obtain $b=1.62062$, as verified in the code linked above. If we use $\boldsymbol{x_6}$, then we obtain $b=1.62059$. So why not average over all the support vectors (well, it's just these two in this case) to accommodate the slight variations that arise in numerical solutions. This average is $b=1.62060$, which matches the value of $b$ calculated in part 1.

But let’s get used to performing classification using the dual form (which emphasizes the Lagrange multipliers $\boldsymbol{\alpha}$ rather than the weights and bias). Whereas before we performed classification by checking the sign of

let's apply the relation $\boldsymbol{w}=\sum_m\alpha_m\boldsymbol{x_m}y_m$, derived in the Lagrange multipliers section above, to classify new points by the sign of

where $\boldsymbol{K^\prime}$ is an $m$×$m$ matrix with diagonal values of $\boldsymbol{x_m}\cdot\boldsymbol{x}=\phi(x_m)\cdot\phi(x)=k(x_m,x)$; all other entries are zero.

As before, we should expect the result to be exactly –1 for the support vectors of the –1 class and +1 for the support vectors of the +1 class. This output can also be verified through the code linked above. We obtain the same values of $\boldsymbol{w}$ and $b$.

So if the output of solving the primal problem is exactly the same as that of the dual problem, why did we bother working through this transformation? The reason is that now that we're aware of the kernel trick, it's not necessary to work with the customized and decomposable kernel $k(x_m,x_n)=x_m x_n+(x_m-7)^2(x_n-7)^2$ (which we chose by inspection for a relatively simple input data set), which resulted in the following divisions and support vectors:

In fact, we might also obtain satisfactory results with the decomposable kernel $k(x_m,x_n)=\phi(x_m)\cdot\phi(x_n)=x_m x_n+(x_m-x_0)^2(x_n-x_0)^2$ but with an offset of $x_0=0$:

In other words, we didn’t need to be careful in choosing an offset of 7 at all. Wow.

And we might also obtain satisfactory results with a kernel of $k(x_m,x_n)=(x_m\cdot x_n+c)^d$ (the so-called polynomial kernel), where we're free to choose the values of $c$ and $d$. For our data, I obtain correct classification with $c=1$ and $d=2$, for example:

Actually, anything in the ranges of $c=1-100$ and $d=2-4$ allows correct classification. It's remarkable that such a broad range of kernel functions results in correct classification for this dataset that's not linearly separable in 1D.

What's happening here is that a kernel of $(x_mx_n+1)^2$, for example, implies that the data have been transformed into a three-dimensional space: $\boldsymbol{x_m}=\phi(x_m)=(1,x_m^2,\sqrt{2}x_m)$. (It's good practice to verify that the dot product $\boldsymbol{x_m\cdot x_n}$ in this case indeed gives the kernel function $k(x_m,x_n)=(x_mx_n+1)^2$ exactly.) And maybe it shouldn't be surprising, given our successful customized transformation into a second dimension, that machine transformation into three dimensions also provides opportunities for correct classification. The point is that we didn't have to design this transformation or visualize the exact results in three dimensions; we only had to test the input parameters of a certain kernel to secure a valid output.

Furthermore, we might obtain satisfactory results with a kernel of $\exp\left(-\frac{(x_m-x_n)^2}{2\sigma^2}\right)$ (sometimes called the Gaussian kernel), where we're free to choose the value of $\sigma$. For our data, I obtain correct classification simply with $\sigma=1$, for example:

It turns out that anything in the range of $\sigma=0.5-20$ allows correct classification.

Note how this particular kernel renders almost all of our input data as support vectors. This is a characteristic of so-called radial basis kernels, or kernels that incorporate the distance between points as $||x_m-x_n||$—they tend to incorporate much of the complete data set when producing a result.

There's an interesting feature of this Gaussian kernel (and other kernels that contain the exponential function). Recall that the preceding kernel, $k(x_m,x_n)=(x_m\cdot x_n+c)^d$, could be algebraically shown to represent a transformation into three dimensions (that we didn't have to worry about or visualize). A startling aspect of the Gaussian kernel is that it accesses an infinite-dimensional (!) space—that is (letting $\sigma=\sqrt{1/2}$ for simplicity), this kernel can be decomposed into

which is amenable to a Taylor series expansion of the last exponential term to give

$$\left(\sum_k^\infty\frac{2^{k/2}x_m^k\exp(-x_m^2)}{\sqrt{k!}}\right)\cdot\left(\sum_k^\infty\frac{2^{k/2}x_n^k\exp(-x_n^2)}{\sqrt{k!}}\right),$$

which is a symmetric dot product of identical functions of $x_m$ and $x_n$, revealing that we can, in fact, view the Gaussian kernel as representing an infinite-degree transformation. Although we could never conceivably design every aspect of such a transformation, this kernel effectively applies the operation for us.

And finally, we might obtain satisfactory results with a kernel of $\min(x_m,x_n)$:

Wait, hold up. This kernel is almost trivial. It doesn't even really do anything; it only selects the lesser of two values. But it still manages to classify our groups correctly? Yes. This output can also be verified using the code code linked above.

This outcome of multiple generic kernels providing correct classification is (in my opinion) as amazing as the finding (presented in part 1) that variable transformations can be used to convert overlapping data into linearly separable data. We're able to obtain accurate classification from a variety of standardized frameworks; although we have to experiment with the input parameters to a certain extent, suitable classification appears to be pretty straightforward, as least for this problem.

But where do these expressions come from? The situation reminds me of that of Bayesian priors, where the joke is that nobody knows and we shouldn’t ask. In this case, we would generally like the Gram matrix $\boldsymbol{K}$ of the candidate kernel function to be symmetric and to be positive semidefinite (which is sort of the matrix equivalent of $\ge 0$, although it doesn't imply that any particular element is $\ge 0$), corresponding to the so-called Mercer conditions, the exploration of which is outside the scope of this note. In addition, different kernels are favored for different applications; a variety of kernels and their applications are presented here.

Soft margin classification via slack variables

There's one more step before applying an actual support vector machine package to our example data. After all, we always wish to understand what variables we're assigning, and there's a certain parameter (usually designated $C$) that plays an enormous role in support vector machine operation and should be discussed.

Maloney Home Page | Support vector regression applied to Raman spectra (part 2)

A brief review

Converting the primal form to the dual form

Soft margin classification via slack variables

Operating a support vector machine—finally

A gallery of classification results

What still needs to be covered