2016-02-09

## Nonlinearly separable data

• A linear boundary might be too simple to capture the class structure.

• One way of getting a nonlinear decision boundary in the input space is to find a linear decision boundary in an expanded space (e.g., for polynomial regression.)

• Thus, $${{{\mathbf{x}}_i}}$$ is replaced by $$\phi({{{\mathbf{x}}_i}})$$, where $$\phi$$ is called a feature mapping

more flexible decision boundary $$\approx$$ enriched feature space

## Margin optimization in feature space

• Replacing $${{{\mathbf{x}}_i}}$$ with $$\phi({{{\mathbf{x}}_i}})$$, the dual form becomes:
 max $$\sum_{i=1}^n\alpha_i-\frac{1}{2}\sum_{i,j=1}^n{{y}}_i{{y}}_j\alpha_i\alpha_j({\phi({{{\mathbf{x}}_i}})\cdot\phi({{{\mathbf{x}}_j}})})$$ w.r.t. $$\alpha_i$$ s.t. $$0\leq\alpha_i\leq C$$ and $$\sum_{i=1}^n\alpha_i{{y}}_i=0$$
• Classification of an input $${\mathbf{x}}$$ is given by: $h_{{{\mathbf{w}}},w_0}({{{\mathbf{x}}}}) = \mbox{sign}\left(\sum_{i=1}^n\alpha_i{{y}}_i({\phi({{{\mathbf{x}}_i}})\cdot\phi({{{\mathbf{x}}}})})+w_0\right)$

• Note that in the dual form, to do both SVM training and prediction, we only ever need to compute dot-products of feature vectors.

## Kernel functions

• Whenever a learning algorithm (such as SVMs) can be written in terms of dot-products, it can be generalized to kernels.

• A kernel is any function $$K:\mathbb{R}^n\times\mathbb{R}^n\mapsto\mathbb{R}$$ which corresponds to a dot product for some feature mapping $$\phi$$: $K({{{\mathbf{x}}}}_1,{{{\mathbf{x}}}}_2)=\phi({{{\mathbf{x}}}}_1)\cdot\phi({{{\mathbf{x}}}}_2) \text{ for some }\phi.$

• Conversely, by choosing feature mapping $$\phi$$, we implicitly choose a kernel function

• Recall that $$\phi({{{\mathbf{x}}}}_1)\cdot \phi({{{\mathbf{x}}}}_2) \propto \cos \angle({{{\mathbf{x}}}}_1,{{{\mathbf{x}}}}_2)$$ where $$\angle$$ denotes the angle between the vectors, so a kernel function can be thought of as a notion of similarity.

• Let $$K({\mathbf{x}},{\bf z} )= \left({\mathbf{x}}\cdot {\bf z}\right)^2$$.

• Is this a kernel? $K({\mathbf{x}},{\bf z}) = \left( \sum_{i=1}^p x_i z_i\right) \left( \sum_{j=1}^p x_j z_j \right) = \sum_{i,j\in\{1\ldots p\}} \left( x_i x_j \right) \left( z_i z_j \right)$

• Hence, it is a kernel, with feature mapping: $\phi({\mathbf{x}}) = \langle x_1^2, ~x_1x_2, ~\ldots, ~x_1x_p, ~x_2x_1, ~x_2^2, ~\ldots, ~x_p^2 \rangle$ Feature vector includes all squares of elements and all cross terms.

• Note that computing $$\phi$$ takes $$O(p^2)$$ but computing $$K$$ takes only $$O(p)$$!

## Polynomial kernels

• More generally, $$K({\mathbf{x}},{{\mathbf{z}}}) = ({\mathbf{x}}\cdot{{\mathbf{z}}})^d$$ is a kernel, for any positive integer $$d$$: $K({\mathbf{x}},{{\mathbf{z}}}) = \left(\sum_{i=1}^n x_i z_i\right)^d$

• If we expanded the sum above in the obvious way, we get $$n^d$$ terms (i.e. feature expansion)

• Terms are monomials (products of $$x_i$$) with degree equal to $$d$$.

• If we use the primal form of the SVM, each of these will have a weight associated with it!

• Curse of dimensionality: it is very expensive both to optimize and to predict with an SVM in primal form

• However, evaluating the dot-product of any two feature vectors can be done using $$K$$ in $$O(n)$$!

## The "kernel trick"

• If we work with the dual, we do not actually have to ever compute the feature mapping $$\phi$$. We just have to compute the similarity $$K$$.

• That is, we can solve the dual for the $$\alpha_i$$:

 max $$\sum_{i=1}^n\alpha_i-\frac{1}{2}\sum_{i,j=1}^n{{y}}_i{{y}}_j\alpha_i\alpha_j K({\mathbf{x}}_i,{\mathbf{x}}_j)$$ w.r.t. $$\alpha_i$$ s.t. $$0\leq\alpha_i\leq C$$, $$\sum_{i=1}^n\alpha_i{{y}}_i=0$$
• The class of a new input $${\mathbf{x}}$$ is computed as: $\hspace{-0.5in}h_{{{\mathbf{w}}},w_0}({\mathbf{x}}) = \mbox{sign} \left( \sum_{i=1}^n \alpha_i y_i K({\mathbf{x}}_i,{\mathbf{x}}) + w_0 \right)$

• Often, $$K(\cdot,\cdot)$$ can be evaluated in $$O(p)$$ timeâ€”a big savings!

## Some other (fairly generic) kernel functions

• $$K({\mathbf{x}},{{\mathbf{z}}})=(1+{\mathbf{x}}\cdot{{\mathbf{z}}})^d$$ â€“ feature expansion has all monomial terms of degree $$\leq d$$.

• Radial basis/Gaussian kernel: $K({\mathbf{x}},{{\mathbf{z}}}) = \exp(-\|{\mathbf{x}}-{{\mathbf{z}}}\|^2/2\sigma^2)$ The kernel has an infinite-dimensional feature expansion, but dot-products can still be computed in $$O(n)$$!

• Sigmoidal kernel: $K({\mathbf{x}},{{\mathbf{z}}}) = \tanh (c_1 {\mathbf{x}}\cdot{{\mathbf{z}}}+ c_2)$