Nonlinearly separable data

  • A linear boundary might be too simple to capture the class structure.

  • One way of getting a nonlinear decision boundary in the input space is to find a linear decision boundary in an expanded space (similar to polynomial regression.)

  • Thus, \({{{\mathbf{x}}_i}}\) is replaced by \(\phi({{{\mathbf{x}}_i}})\), where \(\phi\) is called a feature mapping

Separability by adding features

Separability by adding features

Separability by adding features

more flexible decision boundary \(\approx\) enriched feature space

Margin optimization in feature space

  • Replacing \({{{\mathbf{x}}_i}}\) with \(\phi({{{\mathbf{x}}_i}})\), the dual form becomes:
max \(\sum_{i=1}^n\alpha_i-\frac{1}{2}\sum_{i,j=1}^n{{y}}_i{{y}}_j\alpha_i\alpha_j({\phi({{{\mathbf{x}}_i}})\cdot\phi({{{\mathbf{x}}_j}})})\)
w.r.t. \(\alpha_i\)
s.t. \(0\leq\alpha_i\leq C\) and \(\sum_{i=1}^n\alpha_i{{y}}_i=0\)
  • Classification of an input \({\mathbf{x}}\) is given by: \[h_{{{\mathbf{w}}},w_0}({{{\mathbf{x}}}}) = \mbox{sign}\left(\sum_{i=1}^n\alpha_i{{y}}_i({\phi({{{\mathbf{x}}_i}})\cdot\phi({{{\mathbf{x}}}})})+w_0\right)\]

  • Note that in the dual form, to do both SVM training and prediction, we only ever need to compute dot-products of feature vectors.

Kernel functions

  • Whenever a learning algorithm (such as SVMs) can be written in terms of dot-products, it can be generalized to kernels.

  • A kernel is any function \(K:\mathbb{R}^n\times\mathbb{R}^n\mapsto\mathbb{R}\) which corresponds to a dot product for some feature mapping \(\phi\): \[K({{{\mathbf{x}}}}_1,{{{\mathbf{x}}}}_2)=\phi({{{\mathbf{x}}}}_1)\cdot\phi({{{\mathbf{x}}}}_2) \text{ for some }\phi.\]

  • Conversely, by choosing feature mapping \(\phi\), we implicitly choose a kernel function

  • Recall that \(\phi({{{\mathbf{x}}}}_1)\cdot \phi({{{\mathbf{x}}}}_2) \propto \cos \angle(\phi({{{\mathbf{x}}}}_1),\phi({{{\mathbf{x}}}}_2))\) where \(\angle\) denotes the angle between the vectors, so a kernel function can be thought of as a notion of similarity.

Example: Quadratic kernel

  • Let \(K({\mathbf{x}},{\bf z} )= \left({\mathbf{x}}\cdot {\bf z}\right)^2\).

  • Is this a kernel? \[K({\mathbf{x}},{\bf z}) = \left( \sum_{i=1}^p x_i z_i\right) \left( \sum_{j=1}^p x_j z_j \right) = \sum_{i,j\in\{1\ldots p\}} \left( x_i x_j \right) \left( z_i z_j \right)\]

  • Hence, it is a kernel, with feature mapping: \[\phi({\mathbf{x}}) = \langle x_1^2, ~x_1x_2, ~\ldots, ~x_1x_p, ~x_2x_1, ~x_2^2, ~\ldots, ~x_p^2 \rangle\] Feature vector includes all squares of elements and all cross terms.

  • Note that computing \(\phi\) takes \(O(p^2)\) but computing \(K\) takes only \(O(p)\)!

Polynomial kernels

  • More generally, \(K({\mathbf{x}},{{\mathbf{z}}}) = (1 + {\mathbf{x}}\cdot{{\mathbf{z}}})^d\) is a kernel, for any positive integer \(d\).

  • If we expanded the product above, we get terms for all degrees up to and including \(d\) (in \(x_i\) and \(z_i\)).

  • If we use the primal form of the SVM, each of these will have a weight associated with it!

  • Curse of dimensionality: it is very expensive both to optimize and to predict with an SVM in primal form with many features.

The "kernel trick"

  • If we work with the dual, we do not actually have to ever compute the features using \(\phi\). We just have to compute the similarity \(K\).

  • That is, we can solve the dual for the \(\alpha_i\):

    max \(\sum_{i=1}^n\alpha_i-\frac{1}{2}\sum_{i,j=1}^n{{y}}_i{{y}}_j\alpha_i\alpha_j K({\mathbf{x}}_i,{\mathbf{x}}_j)\) w.r.t. \(\alpha_i\)
    s.t. \(0\leq\alpha_i\leq C\), \(\sum_{i=1}^n\alpha_i{{y}}_i=0\)
  • The class of a new input \({\mathbf{x}}\) is computed as: \[\hspace{-0.5in}h_{{{\mathbf{w}}},w_0}({\mathbf{x}}) = \mbox{sign} \left( \sum_{i=1}^n \alpha_i y_i K({\mathbf{x}}_i,{\mathbf{x}}) + w_0 \right)\]

  • Often, \(K(\cdot,\cdot)\) can be evaluated in \(O(p)\) time—a big savings!

Some other (fairly generic) kernel functions

  • \(K({\mathbf{x}},{{\mathbf{z}}})=(1+{\mathbf{x}}\cdot{{\mathbf{z}}})^d\) – feature expansion has all monomial terms of degree \(\leq d\).

  • Radial basis/Gaussian kernel (most popular): \[K({\mathbf{x}},{{\mathbf{z}}}) = \exp(-\|{\mathbf{x}}-{{\mathbf{z}}}\|^2/2\sigma^2)\] The kernel has an infinite-dimensional feature expansion, but dot-products can still be computed in \(O(n)\)!

  • Sigmoid kernel: \[K({\mathbf{x}},{{\mathbf{z}}}) = \tanh (c_1 {\mathbf{x}}\cdot{{\mathbf{z}}}+ c_2)\]

Example: Gaussian kernel