2018-10-18

Nonlinearly separable data

  • A linear boundary might be too simple to capture the class structure.

  • One way of getting a nonlinear decision boundary in the input space is to find a linear decision boundary in an expanded space (similar to polynomial regression.)

  • Thus, \({{\mathbf{x}_i}}\) is replaced by \(\phi({{\mathbf{x}_i}})\), where \(\phi\) is called a feature mapping

Separability by adding features

Separability by adding features

Separability by adding features

more flexible decision boundary \(\approx\) enriched feature space

Margin optimization in feature space

  • Replacing \({{\mathbf{x}_i}}\) with \(\phi({{\mathbf{x}_i}})\), the dual form becomes:
max \(\sum_{i=1}^n\alpha_i-\frac{1}{2}\sum_{i,j=1}^n{{y}}_i{{y}}_j\alpha_i\alpha_j({\phi({{\mathbf{x}_i}})\cdot\phi({{\mathbf{x}_j}})})\)
w.r.t. \(\alpha_i\)
s.t. \(0\leq\alpha_i\leq C\) and \(\sum_{i=1}^n\alpha_i{{y}}_i=0\)
  • Classification of an input \(\mathbf{x}\) is given by: \[h_{{{\mathbf{w}}},w_0}({{\mathbf{x}}}) = \mbox{sign}\left(\sum_{i=1}^n\alpha_i{{y}}_i({\phi({{\mathbf{x}_i}})\cdot\phi({{\mathbf{x}}})})+w_0\right)\]

  • Note that in the dual form, to do both SVM training and prediction, we only ever need to compute dot-products of feature vectors.

Kernel functions

  • Whenever a learning algorithm (such as SVMs) can be written in terms of dot-products, it can be generalized to kernels.

  • A kernel is any function \(K:\mathbb{R}^n\times\mathbb{R}^n\mapsto\mathbb{R}\) which corresponds to a dot product for some feature mapping \(\phi\): \[K({{\mathbf{x}}}_1,{{\mathbf{x}}}_2)=\phi({{\mathbf{x}}}_1)\cdot\phi({{\mathbf{x}}}_2) \text{ for some }\phi.\]

  • Conversely, by choosing feature mapping \(\phi\), we implicitly choose a kernel function

  • Recall that \(\phi({{\mathbf{x}}}_1)\cdot \phi({{\mathbf{x}}}_2) \propto \cos \angle(\phi({{\mathbf{x}}}_1),\phi({{\mathbf{x}}}_2))\) where \(\angle\) denotes the angle between the vectors, so a kernel function can be thought of as a notion of similarity.

Example: Quadratic kernel

  • Let \(K(\mathbf{x},{\bf z} )= \left(\mathbf{x}\cdot {\bf z}\right)^2\).

  • Is this a kernel? \[K(\mathbf{x},{\bf z}) = \left( \sum_{i=1}^p x_i z_i\right) \left( \sum_{j=1}^p x_j z_j \right) = \sum_{i,j\in\{1\ldots p\}} \left( x_i x_j \right) \left( z_i z_j \right)\]

  • Hence, it is a kernel, with feature mapping: \[\phi(\mathbf{x}) = \langle x_1^2, ~x_1x_2, ~\ldots, ~x_1x_p, ~x_2x_1, ~x_2^2, ~\ldots, ~x_p^2 \rangle\] Feature vector includes all squares of elements and all cross terms.

  • Note that computing \(\phi\) takes \(O(p^2)\) but computing \(K\) takes only \(O(p)\)!

Polynomial kernels

  • More generally, \(K(\mathbf{x},{{\mathbf{z}}}) = (1 + \mathbf{x}\cdot{{\mathbf{z}}})^d\) is a kernel, for any positive integer \(d\).

  • If we expanded the product above, we get terms for all degrees up to and including \(d\) (in \(x_i\) and \(z_i\)).

  • If we use the primal form of the SVM, each of these will have a weight associated with it!

  • Curse of dimensionality: it is very expensive both to optimize and to predict with an SVM in primal form with many features.

The “kernel trick”

  • If we work with the dual, we do not actually have to ever compute the features using \(\phi\). We just have to compute the similarity \(K\).

  • That is, we can solve the dual for the \(\alpha_i\):

    max \(\sum_{i=1}^n\alpha_i-\frac{1}{2}\sum_{i,j=1}^n{{y}}_i{{y}}_j\alpha_i\alpha_j K(\mathbf{x}_i,\mathbf{x}_j)\) w.r.t. \(\alpha_i\)
    s.t. \(0\leq\alpha_i\leq C\), \(\sum_{i=1}^n\alpha_i{{y}}_i=0\)
  • The class of a new input \(\mathbf{x}\) is computed as: \[\hspace{-0.5in}h_{{{\mathbf{w}}},w_0}(\mathbf{x}) = \mbox{sign} \left( \sum_{i=1}^n \alpha_i y_i K(\mathbf{x}_i,\mathbf{x}) + w_0 \right)\]

  • Often, \(K(\cdot,\cdot)\) can be evaluated in \(O(p)\) time—a big savings!

Some other (fairly generic) kernel functions

  • \(K(\mathbf{x},{{\mathbf{z}}})=(1+\mathbf{x}\cdot{{\mathbf{z}}})^d\) – feature expansion has all monomial terms of degree \(\leq d\).

  • Radial Basis Function (RBF)/Gaussian kernel (most popular): \[K(\mathbf{x},{{\mathbf{z}}}) = \exp(-\gamma \|\mathbf{x}-{{\mathbf{z}}}\|^2)\] The kernel has an infinite-dimensional feature expansion, but dot-products can still be computed in \(O(n)\)!

  • Sigmoid kernel: \[K(\mathbf{x},{{\mathbf{z}}}) = \tanh (c_1 \mathbf{x}\cdot{{\mathbf{z}}}+ c_2)\]

Example: Radial Basis Function (RBF) / Gaussian kernel