2018-10-18

## Nonlinearly separable data

• A linear boundary might be too simple to capture the class structure.

• One way of getting a nonlinear decision boundary in the input space is to find a linear decision boundary in an expanded space (similar to polynomial regression.)

• Thus, $${{\mathbf{x}_i}}$$ is replaced by $$\phi({{\mathbf{x}_i}})$$, where $$\phi$$ is called a feature mapping

## Separability by adding features

more flexible decision boundary $$\approx$$ enriched feature space

## Margin optimization in feature space

• Replacing $${{\mathbf{x}_i}}$$ with $$\phi({{\mathbf{x}_i}})$$, the dual form becomes:
 max $$\sum_{i=1}^n\alpha_i-\frac{1}{2}\sum_{i,j=1}^n{{y}}_i{{y}}_j\alpha_i\alpha_j({\phi({{\mathbf{x}_i}})\cdot\phi({{\mathbf{x}_j}})})$$ w.r.t. $$\alpha_i$$ s.t. $$0\leq\alpha_i\leq C$$ and $$\sum_{i=1}^n\alpha_i{{y}}_i=0$$
• Classification of an input $$\mathbf{x}$$ is given by: $h_{{{\mathbf{w}}},w_0}({{\mathbf{x}}}) = \mbox{sign}\left(\sum_{i=1}^n\alpha_i{{y}}_i({\phi({{\mathbf{x}_i}})\cdot\phi({{\mathbf{x}}})})+w_0\right)$

• Note that in the dual form, to do both SVM training and prediction, we only ever need to compute dot-products of feature vectors.

## Kernel functions

• Whenever a learning algorithm (such as SVMs) can be written in terms of dot-products, it can be generalized to kernels.

• A kernel is any function $$K:\mathbb{R}^n\times\mathbb{R}^n\mapsto\mathbb{R}$$ which corresponds to a dot product for some feature mapping $$\phi$$: $K({{\mathbf{x}}}_1,{{\mathbf{x}}}_2)=\phi({{\mathbf{x}}}_1)\cdot\phi({{\mathbf{x}}}_2) \text{ for some }\phi.$

• Conversely, by choosing feature mapping $$\phi$$, we implicitly choose a kernel function

• Recall that $$\phi({{\mathbf{x}}}_1)\cdot \phi({{\mathbf{x}}}_2) \propto \cos \angle(\phi({{\mathbf{x}}}_1),\phi({{\mathbf{x}}}_2))$$ where $$\angle$$ denotes the angle between the vectors, so a kernel function can be thought of as a notion of similarity.

## Example: Quadratic kernel

• Let $$K(\mathbf{x},{\bf z} )= \left(\mathbf{x}\cdot {\bf z}\right)^2$$.

• Is this a kernel? $K(\mathbf{x},{\bf z}) = \left( \sum_{i=1}^p x_i z_i\right) \left( \sum_{j=1}^p x_j z_j \right) = \sum_{i,j\in\{1\ldots p\}} \left( x_i x_j \right) \left( z_i z_j \right)$

• Hence, it is a kernel, with feature mapping: $\phi(\mathbf{x}) = \langle x_1^2, ~x_1x_2, ~\ldots, ~x_1x_p, ~x_2x_1, ~x_2^2, ~\ldots, ~x_p^2 \rangle$ Feature vector includes all squares of elements and all cross terms.

• Note that computing $$\phi$$ takes $$O(p^2)$$ but computing $$K$$ takes only $$O(p)$$!

## Polynomial kernels

• More generally, $$K(\mathbf{x},{{\mathbf{z}}}) = (1 + \mathbf{x}\cdot{{\mathbf{z}}})^d$$ is a kernel, for any positive integer $$d$$.

• If we expanded the product above, we get terms for all degrees up to and including $$d$$ (in $$x_i$$ and $$z_i$$).

• If we use the primal form of the SVM, each of these will have a weight associated with it!

• Curse of dimensionality: it is very expensive both to optimize and to predict with an SVM in primal form with many features.

## The “kernel trick”

• If we work with the dual, we do not actually have to ever compute the features using $$\phi$$. We just have to compute the similarity $$K$$.

• That is, we can solve the dual for the $$\alpha_i$$:

 max $$\sum_{i=1}^n\alpha_i-\frac{1}{2}\sum_{i,j=1}^n{{y}}_i{{y}}_j\alpha_i\alpha_j K(\mathbf{x}_i,\mathbf{x}_j)$$ w.r.t. $$\alpha_i$$ s.t. $$0\leq\alpha_i\leq C$$, $$\sum_{i=1}^n\alpha_i{{y}}_i=0$$
• The class of a new input $$\mathbf{x}$$ is computed as: $\hspace{-0.5in}h_{{{\mathbf{w}}},w_0}(\mathbf{x}) = \mbox{sign} \left( \sum_{i=1}^n \alpha_i y_i K(\mathbf{x}_i,\mathbf{x}) + w_0 \right)$

• Often, $$K(\cdot,\cdot)$$ can be evaluated in $$O(p)$$ time—a big savings!

## Some other (fairly generic) kernel functions

• $$K(\mathbf{x},{{\mathbf{z}}})=(1+\mathbf{x}\cdot{{\mathbf{z}}})^d$$ – feature expansion has all monomial terms of degree $$\leq d$$.

• Radial Basis Function (RBF)/Gaussian kernel (most popular): $K(\mathbf{x},{{\mathbf{z}}}) = \exp(-\gamma \|\mathbf{x}-{{\mathbf{z}}}\|^2)$ The kernel has an infinite-dimensional feature expansion, but dot-products can still be computed in $$O(n)$$!

• Sigmoid kernel: $K(\mathbf{x},{{\mathbf{z}}}) = \tanh (c_1 \mathbf{x}\cdot{{\mathbf{z}}}+ c_2)$