---
title: "Linear Models for Classification"
author: "Dan Lizotte"
date: '`r Sys.Date()`'
output:
beamer_presentation: default
html_document: default
ioslides_presentation:
css: ../my_ioslides.css
pdf_document: null
---
```{r echo=F,message=F,warning=F}
library(dplyr);library(ggplot2)
wpbc_featurenames = c("Radius","Texture","Perimeter","Area","Smoothness","Compactness","Concavity","Concave Points","Symmetry","Fractal Dim")
wpbc_names = c("ID","Outcome","Time",paste(wpbc_featurenames,"Mean"),paste(wpbc_featurenames,"SE"),paste(wpbc_featurenames,"Worst"),"Tumor Diameter","Lymph Status")
bc <- read.csv("~dlizotte/Seafile/My Library/Teaching/cs4437/Lectures/6_Linear Models/data/wpbc.data.txt",header=FALSE,col.names=wpbc_names)
```
\newcommand{\T}{\mathsf{T}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\X}{\mathbf{X}}
\newcommand{\w}{\mathbf{w}}
## Linear models in general HTF Ch. 2.8.3
- By linear models, we mean that the hypothesis function
$h_{\bf w}({\bf
x})$ is a *linear function of the parameters ${\bf
w}$*.
- Predictions are a *linear combination of
feature values*
- $$h_{\bf w}(\x) = \sum_{k=0}^{p} w_k \phi_k(\x) = {{\boldsymbol{\phi}}}(\x)^\T {{\mathbf{w}}}$$
where $\phi_k$ are called *basis functions*
As usual, we let
$\phi_0(\x)=1, \forall \x$,
to create a bias.
- To recover degree-$d$ polynomial regression in one variable, set
$$\phi_0(x) = 1, \phi_1(x) = x, \phi_2(x) = x^2, ..., \phi_d(x) = x^d$$
- Basis functions are *fixed* for training
## Linear Methods for Classification
- Classification tasks
- Error functions for classification
- Logistic Regression
- Support Vector Machines
## Example: Given nucleus radius, predict cancer recurrence {.smaller}
```{r}
ggplot(bc,aes(Radius.Mean,fill=Outcome,color=Outcome)) + geom_density(alpha=I(1/2))
```
## Example: Solution by linear regression
- Univariate real input: nucleus size
- Output coding: non-recurrence = 0, recurrence = 1
- **Sum squared error** minimized by the blue line
```{r echo=F}
bc <- bc %>% mutate(binOutcome = as.numeric(Outcome == "R"))
ggplot(bc,aes(x=Radius.Mean,y=binOutcome)) + geom_point(aes(colour=Outcome)) + geom_smooth(method="lm",se=F)
```
## Linear regression for classification
- The predictor shows an increasing trend towards recurrence with
larger nucleus size, as expected.
- Output *cannot be directly interpreted* as a
class prediction.
- Thresholding output (e.g., at 0.5) could be used to predict 0 or 1.\
(In this case, prediction would be 0 except for extremely large
nucleus size.)
- Interpret as probability? Not bounded to $[0,1]$, not consistent
even for well-separated data
## Probabilistic view
- Suppose we have two possible classes: $y\in \{0,1\}$.
- The symbols “$0$” and “$1$” are unimportant. Could have been
$\{a,b\}$, $\{\mathit{up},\mathit{down}\}$, whatever.
- Rather than try to predict the class label directly, ask:\
What is the *probability* that a given input
$\x$ to has class $y=1$?
- Conditional Probability:
$$P(y=1|\X = \x) = \frac{P(\X = \x, y=1)}{P(\X = \x)} $$
- Bayes' Rule
$$ = \frac{P(\X = \x| y=1)P(y=1)}{P(\X = \x|y=1)P(y=1)+P(\X = \x|y=0)P(y=0)} $$
## Probabilistic models for binary classification {.smaller}
- Can also write:
$$P(y=1|\X = \x)=\sigma\left(\log\frac{P(y=1|\X = \x)}{P(y=0|\X = \x)}\right) = \sigma\left(\log\frac{P(\X = \x|y=1)P(y=1)}{P(\X = \x|y=0)P(y=0)}\right)$$
where $\sigma(a) = \frac{1}{1+\exp(-a)}$, the *sigmoid*
or *logistic function*.
- Discriminative Learning:
- Model $\log\frac{P(y=1|\X = \x)}{P(y=0|\X = \x)}$
(*log odds*) as a function of
$\mathbf{x}$
- Only models how to *discriminate* between examples of the
two classes. Does not model distribution of
$\mathbf{x}$.
- Generative Learning:
- Model $P(y=1), P(y=0), P(\X = \x|y=1), P(\X = \x|y=0)$,
then use rightmost formula above
- Models the full joint; can actually use the model to *generate* (i.e.
fantasize) data
## Logistic regression HTF (Ch. 4.4)
- Represent the hypothesis as a logistic function of a linear
combination of inputs:
$$h(\x) = \sigma(\x^\T \w)$$
- Interpret $h(\x)$ as
$P(y=1|\X = \x)$, interpret
$\x^\T \w$ as the
log-odds
- How do we choose ${\bf w}$?
- In the probabilistic framework, observing
$\langle \x_i , 1 \rangle$
does not mean $h(\x_i)$ should be as close to $1$ as possible.
- **Maximize probability the model assigns to the $y_i$,
*given the* $\x_i$**.
## Max Conditional Likelihood {.smaller}
- **Maximize probability the model assigns to the $y_i$,
*given the* $\x_i$**.
- Assumption 1: Examples are i.i.d. Probability of
observing all $y$s is product $$\begin{gathered}
P(Y_1=y_1, Y_2=y_2, ..., Y_n = y_n|X_1 = \x_1, X_2 = \x_2, ..., X_n =
\x_n) \\
= \prod_{i=1}^n P(Y_i = y_i | X_i = \x_i)\end{gathered}$$
- Assumption 2: $$\begin{aligned}
P(y = 1|\X = \x) & = h_\w(\x) = 1 / (1
+ \exp(-\x^\T \w))\\
P(y = 0|\X = \x) & = (1 - h_\w(\x))\\\end{aligned}$$
## Max Conditional Likelihood {.smaller}
- **Maximize probability the model assigns to the $y_i$,
*given the* $\x_i$**.
- More stable to maximize log probability. Note
$$\begin{aligned}
\log P(Y_i = y_i | X_i = \x_i) & =
\left\{
\begin{array}{ll}
\log h_\w(\x_i) & \mbox{if}~y_i=1 \\
\log(1-h_\w(\x_i)) & \mbox{if}~y_i=0
\end{array}
\right.
\end{aligned}
$$
- Therefore,
$$\log \prod_{i=1}^n P(Y_i = y_i | X_i = \x_i) =
\sum_{i = 1}^n \left[y_i
\log( h_\w(\x_i)) + (1 - y_i) \log (1 - h_\w(\x_i))\right]
$$
- Suggests an error $$\begin{aligned}
\hspace{-2em} J(h_{\w}) = - \sum_{i = 1}^n \left[y_i
\log( h_\w(\x_i)) + (1 - y_i) \log (1 - h_\w(\x_i))\right]\end{aligned}$$
- This is the *cross entropy*. Number of bits to transmit
$y_i$
if both parties know $h_\w$ and
$\x_i$.
## Back to the breast cancer problem
```{r echo=F}
ggplot(bc,aes(x=Radius.Mean,y=binOutcome)) + geom_point(aes(colour=Outcome)) +
stat_smooth(method="glm",method.args=list(family="binomial"),se=F,colour="blue") +
stat_smooth(method="lm",se=F,linetype=6,colour="red")
```
Logistic Regression:
```{r echo=F}
bclr <- glm(Outcome ~ Radius.Mean, data=bc, family="binomial")
bclr$coefficients
```
|
Least Squares:
```{r echo=F}
bclm <- lm(binOutcome ~ Radius.Mean, data=bc)
bclm$coefficients
```
|
## Supervised Learning Methods: “Objective-driven”
--------------------------------------------------------------------------------------------------------------------------------------
Mthd. Form Objective
------------ -------------------------------------------- ----------------------------------------------------------------------
OLS $h_w(\x) = \x^\T \w$ $\sum_{i=1}^n (h_\w(\x_i) - y_i)^2$
$\approx E[Y=y|\mathbf{X}=\x]$... ...using a linear function
LR $h_w(\x) = \frac{1}{1 + \mathrm{e}^{-\x^\T \w}}$ $-\sum_{i=1}^n y_i \log h_\w(\x_i) + (1-y_i) \log (1-h_\w(\x_i))$
$\approx P(Y=y|\mathbf{X}=\x)$... ...using sigmoid of a linear function
--------------------------------------------------------------------------------------------------------------------------------------
- Both model the ***conditional mean of $y$*** using a (transformed) ***linear function***
- Both use ***maximum conditional likelihood*** to estimate
## Decision boundary HTF Ch. 2.3.1,2.3.2
- How complicated is a classifier?
- One way to think about it is in terms of its *decision
boundary*, i.e. the line it defines for separating examples
- *Linear classifiers* draw a hyperplane between examples
of the different classes. *Non-linear classifiers*
draw more complicated surfaces between the different classes.
- For a probabilistic classifier with a cutoff of 0.5,\
the decision boundary is the curve on which:
$$\frac{P(y=1|\X = \x)}{P(y=0|\X = \x)} = 1, \mbox{i.e., where } \log\frac{P(y=1|\X = \x)}{P(y=0|\X = \x)} = 0$$
\renewcommand{\Pr}{\mathrm{Pr}}
## Decision boundary
Class = R if $\Pr(Y=1|X=x) > 0.5$
```{r echo=F}
bc <- bc %>% mutate(binOutcome = as.numeric(Outcome == "R"))
bclr <- glm(Outcome ~ Radius.Mean, data=bc, family="binomial")
db <- -bclr$coefficients[1]/bclr$coefficients[2]
ggplot(bc,aes(x=Radius.Mean,y=binOutcome)) + geom_point(aes(colour=Outcome)) + stat_smooth(method="glm",method.args=list(family="binomial"),se=F) +
stat_smooth(method="lm",se=F,linetype=6,color="red") + geom_vline(xintercept=db,color="magenta")
```
## Decision boundary
Class = R if $\Pr(Y=1|X=x) > 0.25$
```{r echo=F,warning=F,message=F}
library(boot)
bc <- bc %>% mutate(binOutcome = as.numeric(Outcome == "R"))
bclr <- glm(Outcome ~ Radius.Mean, data=bc, family="binomial")
db <- (logit(0.25) - bclr$coefficients[1])/bclr$coefficients[2]
ggplot(bc,aes(x=Radius.Mean,y=binOutcome)) + geom_point(aes(colour=Outcome)) + stat_smooth(method="glm",formula=y~x,method.args=list(family="binomial"),se=F) +
stat_smooth(method="lm",se=F,linetype=6,color="red") + geom_vline(xintercept=db,color="magenta")
```
## Decision boundary
Class = R if $\Pr(Y=1|X=x) > 0.5$
```{r echo=F,warning=F,message=F}
library(boot)
bc <- bc %>% mutate(binOutcome = as.numeric(Outcome == "R"))
bclr <- glm(Outcome ~ Radius.Mean + Compactness.Mean, data=bc, family="binomial")
int <- (logit(0.5) - bclr$coefficients[1])/bclr$coefficients[3]
slp <- -bclr$coefficients[2]/bclr$coefficients[3]
ggplot(bc,aes(x=Radius.Mean,y=Compactness.Mean,color=Outcome)) + geom_point(size=3) + geom_abline(intercept=int,slope=slp,color="magenta")
```
## Decision boundary
Class = R if $\Pr(Y=1|X=x) > 0.25$
```{r echo=F,warning=F,message=F}
library(boot)
bc <- bc %>% mutate(binOutcome = as.numeric(Outcome == "R"))
bclr <- glm(Outcome ~ Radius.Mean + Compactness.Mean, data=bc, family="binomial")
int <- (logit(0.25) - bclr$coefficients[1])/bclr$coefficients[3]
slp <- -bclr$coefficients[2]/bclr$coefficients[3]
ggplot(bc,aes(x=Radius.Mean,y=Compactness.Mean,color=Outcome)) + geom_point(size=3) + geom_abline(intercept=int,slope=slp,color="magenta")
```
## Supervised Learning Methods: “Objective-driven”
-----------------------------------------------------------------------------------------------------------------
Mthd. Form Objective
------------------- --------------------------------------------------------- -----------------------------------
OLS $h_w(\x) = \x^\T \w$ $\sum_{i=1}^n(h_\w(\x_i) - y_i)^2$
$\approx E[Y=y|\mathbf{X}=\x]$... ...using a linear function
LR $h_\w(\x) = \frac{1}{1 + \mathrm{e}^{-\x^\T \w}}$ -$\sum_{i=1}^n y_i \log h_\w (\x_i) + (1-y_i) \log (1-h_\w (\x_i))$
$\approx P(Y=y|\mathbf{X}=\x)$... ...using sigmoid of a linear function
SVM $h_\w(\x) = \mathrm{sgn}(\x^\T \w)$
...using a linear function
-----------------------------------------------------------------------------------------------------------------
## Large Margin Classifiers:\
Linear Support Vector Machines
- Linear classifiers that focus on learning the *decision
boundary* rather than the conditional distribution
$P(Y=y|\mathbf{X}=\x)$
- Perceptrons
- Definition
- Perceptron learning rule
- Convergence
- “Margin” idea and max margin classifiers
- (Linear) support vector machines
- Formulation as optimization problem
## Marvin Minsky, 1927-2016
## Perceptrons HTF Ch. 4.5
- Consider a binary classification problem with data
$\{{{\x_i}},y_i\}_{i=1}^n$,
$y_i\in\{-1,+1\}$. **Note coding of $y_i$**.
- A *perceptron* (Rosenblatt, 1957) is a classifier of
the form:
$$h_{{{\mathbf{w}}},w_0}({{\x}}) = \mbox{sign}(\x^\T \w+ w_0) = \left\{ \begin{array}{ll}
+1 & \mathrm{if}~ \x^\T \w+ w_0\geq 0 \\
-1 & \mathrm{otherwise}
\end{array} \right.$$
Here, ${{\mathbf{w}}}$ is a
vector of weights, and $w_0$ is a
constant offset. (**Note $x_0 = 1$ is omitted.**)
- The decision boundary is
$\x^\T \w+ w_0= 0$.
- Perceptrons output a **class**, not a probability
- An example $( {{\x}}, y )$ is
classified correctly if:
$$y \cdot (\x^\T \w+ w_0) > 0$$
## Linear separability
- The data set is *linearly separable* if and only if
there exists ${{\mathbf{w}}}$, $w_0$ such that:
- For all $i$,
$y_i(\x_i^\T \w +w_0)>0$.
- Or equivalently, the 0-1 loss $\sum_i
\mathbf{1}_{y_i(\x_i^\T \w +w_0) < 0}$
is zero for some set of parameters $({\bf w}, w_0)$.
## Linear Separability
## The Perceptron Learning Rule
- Consider the following procedure:
1. Initialize ${{\mathbf{w}}}$ and $w_0$ randomly
2. While any training examples remain incorrecty classified
1. Loop through all misclassified examples
2. For misclassified example $i$, perform the updates:
$${{\mathbf{w}}}\gets {{\mathbf{w}}}+ \delta y_i{{\x}}_i,~~~~~w_0\gets w_0 + \delta y_i$$
where $\delta$ is a step-size parameter.
- The update equation, or sometimes the whole procedure, is called the
*perceptron learning rule*.
- Intuition: For positive examples misclassified as negative, change
${{\mathbf{w}}}$ to increase
$\x_i^\T \w +w_0$,
and vice versa
## Error Minimization Interpretation
- PLR can be interpreted as a gradient
descent on the following function: $${{J}}({{\mathbf{w}}},w_0) = \sum_{i=1}^n \left\{ \begin{array}{ll}
0 & \mathrm{if}~ y_i(\x_i^\T \w + w_0)\geq 0 \\
-y_i(\x_i^\T \w + w_0) & \mathrm{if}~ y_i(\x_i^\T \w + w_0)<0
\end{array}\right.$$
- For correctly classified examples, the error is zero.
- For incorrectly classified examples, the error is by how much
$\x_i^\T \w +w_0$
is on the wrong side of the decision boundary.
- $J$ is piecewise linear, so it has a
gradient almost everywhere; stochastic gradient descent gives the
perceptron learning rule.
- $J$ is zero if and only if all examples are classified correctly –
just like the 0-1 loss function.
## Perceptron convergence theorem
- ***If*** classes are linearly separable ***then***
the perceptron learning rule will find a separater after some finite number of updates.
- The number of updates depends on the data set, and also on the step
size parameter.
- If the classes are not linearly separable, there will be oscillation
(which can be detected automatically).
```{r results='asis',fig.height=6,echo=F}
n <- 20
set.seed(1)
train <- data.frame(x1 = runif(n), x2 = runif(n))
train$y <- (train$x1 + train$x2 > 1)*2 - 1
w <- c(1,1)
w0 <- 1
slp <- -w[1]/w[2]
int <- -w0/w[2]
mistakes = T
plts <- list()
pl <- 1
while(mistakes) {
mistakes = F
for (i in 1:n) {
xv <- as.numeric(train[i,c(1,2)])
y <- as.numeric(train[i,'y'])
if ((sum(w*xv) + w0)*y <= 0) {
train$yhat <- factor(sign((as.matrix(train[,c("x1","x2")]) %*% w) + w0), levels = c(1,-1))
.e <- environment()
plts[[pl]] <- ggplot(train,aes(x=x1,y=x2,color=factor(y, levels = c(1,-1)),shape=yhat),environment=.e) + annotate("point",x=xv[1],y=xv[2],size=3,color="red") + geom_point() + geom_abline(slope=slp,intercept=int) + ggtitle(sprintf("Update: %.1f w: [%.3f,%.3f], w0: %.3f",pl/2,w[1],w[2],w0)) + xlim(-0.5,1.5) + ylim(-0.5,1.5) + scale_shape_discrete(drop = FALSE)
pl <- pl + 1
w <- w + y*xv
w0 <- w0 + y
mistakes = T
slp <- -w[1]/w[2]
int <- -w0/w[2]
train$yhat <- factor(sign((as.matrix(train[,c("x1","x2")]) %*% w) + w0), levels = c(1,-1))
.e <- environment()
plts[[pl]] <- ggplot(train,aes(x=x1,y=x2,color=factor(y, levels=c(1,-1)),shape=yhat),environment=.e) + annotate("point",x=xv[1],y=xv[2],size=3,color="red") + geom_point() + geom_abline(slope=slp,intercept=int) + ggtitle(sprintf("Update: %.1f w: [%.3f,%.3f], w0: %.3f",pl/2,w[1],w[2],w0)) + xlim(-0.5,1.5) + ylim(-0.5,1.5) + scale_shape_discrete(drop = FALSE)
pl <- pl + 1
}
}
}
for ( pl in 1:length(plts) ) {
writeLines("## Perceptron Learning Example\n\n")
plot(plts[[pl]])
writeLines("\n\n")
}
```
## Weight as a combination of input vectors
- Recall percepton learning rule:
$${{\mathbf{w}}}\gets {{\mathbf{w}}}+ \delta y_i{{\x}}_i,~~~~~w_0\gets w_0 + \delta y_i$$
- If initial weights are zero, then at any step, the *weights
are a linear combination of feature vectors of the examples*:
$${{\mathbf{w}}}= \sum_{i=1}^n \alpha_i y_i {{\x_i}},~~~~~w_0 =\sum_{i=1}^n \alpha_i y_i$$
where $\alpha_i$ is the sum of step sizes used for all updates based
on example $i$.
- This is called the *dual representation* of the classifier.
- Even by the end of training, some example may have never
participated in an update (they were always correct) so the
corresponding $\alpha_i=0$.
## Examples used (bold) and not used (faint) in updates {.smaller}
## Comment: Solutions are nonunique
## Perceptron summary
- Perceptrons can be learned to fit linearly separable data, using a
gradient descent rule.
- Blindingly fast
- Solutions are non-unique
## Support Vector Machines
- Support vector machines (SVMs) for binary classification can be
viewed as a way of training perceptrons
- Three main new ideas:
- A optimization criterion (the "margin") guarantees uniqueness and has theoretical
advantages
- Natural handling nonseparable data by allowing mistakes
- An efficient way of operating in expanded feature
spaces: "kernel trick"
- SVMs can also be used for multiclass classification and regression.
## Returning to the non-uniqueness issue
- Consider a linearly separable binary classification data set
- There is an infinite number of hyperplanes that separate the
classes:
- Which plane is best?
- For a given plane, for which points should we be most
confident in the classification?
## The margin, and linear SVMs
- For a given separating hyperplane, the *margin* is two
times the (Euclidean) distance from the hyperplane to the nearest
training example.
- Width of the "strip" around the decision boundary
containing no training examples.
- A linear SVM is a perceptron for which we choose
${{\mathbf{w}}},w_0$ so that margin is maximized
## Distance to the decision boundary
- Suppose we have a decision boundary that separates the data.
- Let $\gamma_i$ be the distance from instance
${{\x_i}}$ to the
decision boundary.
- How can we write $\gamma_i$ in terms of
${{\x_i}}, y_i, {{\mathbf{w}}}, w_0$?
## Distance to the decision boundary (II)
|
- ${{\mathbf{w}}}$ is normal to the
decision boundary. Thus,
$\frac\w{||{{\mathbf{w}}}||}$ is
the unit normal of the boundary.
- Vector from B to $\x_i$ is
$\gamma_i \frac\w{||{{\mathbf{w}}}||}$.
- B, the point on the boundary nearest
${{\x_i}}$, is
${{\x_i}}-\gamma_i \frac\w{||{{\mathbf{w}}}||}$.
- Since B is on the boundary,
$$\left({{\x_i}}-\gamma_i \frac\w{||{{\mathbf{w}}}||}\right)^\T \w + w_0 = 0$$
- Solving for $\gamma_i$ yields
$$\gamma_i = \frac{\x_i^\T \w + w_0}{||\w||}$$
|
## The margin HTF Ch. 4.5, Ch 12
- The *margin of the hyperplane* is $2M$, where
$M=\min_i y_i \gamma_i$
- The most direct statement of the problem of finding a maximum margin
separating hyperplane is thus
$$\max_{\w ,w_0} \min_i y_i \gamma_i \equiv \max_{\w ,w_0} \min_i y_i\frac{\x_i^\T \w + w_0}{||\w||}$$
- This turns out to be inconvenient for optimization, however
## Treating the $\gamma_i$ as constraints
- From the definition of margin, we have:
$$M \leq y_i \gamma_i = y_i \frac{\x_i^\T \w + w_0}{||\w||} ~~~~\forall i$$
- This suggests:\
--------------------- -----------------------------------------
maximize $M$ with respect to $M, {{\mathbf{w}}}, w_0$
subject to $M \leq y_i \frac{\x_i^\T \w + w_0}{||\w||}$ for all $i$
--------------------- -----------------------------------------
- Problems:
- ${{\mathbf{w}}}$ appears nonlinearly in
the constraints.
- This problem is underconstrained. If
$({{\mathbf{w}}},w_0,M)$ is an optimal solution, then
so is $(\beta{{\mathbf{w}}},\beta w_0,M)$ for any
$\beta>0$.
## Adding a constraint
Let’s add the constraint that $M = 1 / \|{{\mathbf{w}}}\|$:
- This allows us to rewrite the objective function:
- This is really nice because the constraints are linear.
--------------------- -----------------------------------------
maximize $\frac{1}{||\w||}$ with respect to ${{\mathbf{w}}}, w_0$
subject to $\frac{1}{||\w||} \leq y_i \frac{\x_i^\T \w + w_0}{||\w||}$ for all $i$
--------------------- -----------------------------------------
which is the same as
--------------------- -----------------------------------------
maximize $\frac{1}{||\w||}$ with respect to ${{\mathbf{w}}}, w_0$
subject to $1 \le y_i (\x_i^\T \w + w_0)$ for all $i$
--------------------- -----------------------------------------
## Final formulation
- Let’s minimize $\|{{\mathbf{w}}}\|^2$ instead of maximizing
$\frac{1}{||\w ||}$.
(Taking the square is a monotone transformation, as
$\|{{\mathbf{w}}}\|$ is postive, so this doesn’t change
the optimal solution.)
- This gets us to:
-------- ---------------------------------------------------------------------------
minimize $\|{{\mathbf{w}}}\|^2$ w.r.t. ${{\mathbf{w}}}, w_0$
subject to $y_i(\x_i^\T \w +w_0)\geq1$
-------- ---------------------------------------------------------------------------
- This we can solve! How?
- It is a convex *quadratic programming* (QP)
problem—a standard type of optimization problem for which many
efficient packages are available.
## Example
We have a solution, but no “support vectors” yet...
## What are "Support Vectors"?
----------- --------------------------------------------------------------- ---------- -------
minimize $\frac{1}{2} \|{{\mathbf{w}}}\|^2$ w.r.t. ${{\mathbf{w}}}, w_0$ subject to $y_i(\x_i^\T \w +w_0)\geq1$
----------- --------------------------------------------------------------- ---------- -------
- Turns out (HTF Ch. 4.5.2) we can write:
$${\bf w}=\sum_i \alpha_i y_i \x_i,~~\mbox{where $\alpha_i \ge 0$}$$
- As for the perceptron with zero initial weights, the optimal
solution for ${{\mathbf{w}}}$ and $w_0$ is a linear combination of
the ${{\x_i}}$.
- The output is therefore:
$$h_{\w,w_0}(\x) = \mbox{sign} \left(\sum_{i=1}^n \alpha_i y_i ({{\x_i}}\cdot {{\x}}) +w_0\right)$$
- Output depends on weighted dot product of input vector with training
examples
## Solving “the dual”
- We can actually solve directly for the $\alpha_i$ (again see HTF
Ch. 4.5.2):
$$\max_{{{\boldsymbol{\alpha}}}} \sum_{i=1}^n \alpha_i -\frac{1}{2} \sum_{i=1}^n
\sum_{j=1}^n y_i y_j \alpha_i \alpha_j (\x_i \cdot \x_j)$$
with constraints:
$\alpha_i \geq 0 \mbox{ and} \sum_i \alpha_i y_i =0$
- This is also a QP
## The support vectors
- Suppose we find optimal ${{\boldsymbol{\alpha}}}$s (e.g.,
using a standard QP package)
- The $\alpha_i$ will be $>0$ only for the points for which
$y_i(\x_i^\T \w + w_0)=1$
- These are the points lying on the edge of the margin, and they are
called *support vectors*, because they define the
decision boundary
- The output of the classifier for query point
$\x$ is computed as:
$$\mbox{sgn}\left[\left(\sum_{i=1}^n \alpha_i y_i (\x_i \cdot \x)\right) + w_0 \right]$$
Hence, the output is determined by computing the *dot product
of the point with the support vectors*!
## Example
Support vectors are in bold
## But why all this work?
- SVMs are a state-of-the-art for classification when you don’t need
probability estimates
- Inuitively, the large-margin property makes sense. Theory backs
this up.
- SVMs offer “off-the-shelf” *non*-linear classification
without having to do explicit feature construction, as we will see.
## Soft margin classifiers
- Recall that in the linearly separable case, we compute the solution
to the following optimization problem:
-------- --------------------------------------------------- ----------------------
min $\frac{1}{2}\|{{\mathbf{w}}}\|^2$ w.r.t. ${{\mathbf{w}}}, w_0$
s.t. $y_i(\x_i^\T \w + w_0)\geq1$
-------- --------------------------------------------------- ----------------------
- What if we can't satisfy the constraints?
## Soft margin classifiers
- To allow misclassifications, we relax the constraints to:
$$y_i(\x_i^\T \w + w_0) \geq 1-\xi_i$$
- If $\xi_i \in (0,1)$, the data point is within the margin
- If $\xi_i \geq 1$, then the data point is misclassified
- We define the *soft error* as $\sum_i \xi_i$; each
$\xi_i$ is a *slack variable*
## Problem formulation with soft errors
- Instead of:
-------- ---------------------------------------------------------------------------
min $\frac{1}{2}\|{{\mathbf{w}}}\|^2$ w.r.t. ${{\mathbf{w}}}, w_0$
s.t. $y_i(\x_i^\T \w + w_0)\geq1$
-------- ---------------------------------------------------------------------------
we want to solve:
-------- -------------------------------------------------------------------------------------------------
min $\frac{1}{2}\|{{\mathbf{w}}}\|^2+ C \sum_i \xi_i$ w.r.t. ${{\mathbf{w}}}, w_0, \xi_i$
s.t. $y_i(\x_i^\T \w + w_0)\geq1-\xi_i$, $\xi_i \geq 0$
-------- -------------------------------------------------------------------------------------------------
- Note that soft errors include points that are misclassified,\
as well as points within the margin
- There is a linear penalty for both categories
- The choice of the *constant $C$ controls boundary-fitting*
## A built-in boundary-fitting knob
-------- ---------------------------------------------------------------------------------
min $\frac{1}{2}\|{{\mathbf{w}}}\|^2+ C \sum_i \xi_i$
w.r.t. ${{\mathbf{w}}}, w_0, \xi_i$
s.t. $y_i(\x_i^\T \w +w_0) \geq 1-\xi_i$, $\xi_i \geq 0$
-------- ---------------------------------------------------------------------------------
- If $C$ is very small, there is almost no penalty for soft errors, so the focus is
on maximizing the margin, even if this means more mistakes
- If $C$ is very large, the emphasis on the soft errors will decrease
the margin, if this helps to classify more examples correctly.
- Internal cross-validation is a good way to choose $C$ appropriately
## Dual form for the soft margin problem
- Like before, we can formulate a “dual” problem that identifies the
support vectors:
-------------------------
Primal form:
------------ ---------------------------------------------------------------------------------------------------------------
min $\|{{\mathbf{w}}}\|^2+{\color{red}{C\sum_i\xi_i}}$ w.r.t. ${{\mathbf{w}}}, w_0, {\color{red}{\xi_i}}$
s.t. ${{y}}_i(\x_i^\T \w +w_0)\geq {\color{red}{(1-\xi_i)}}$, $\xi_i\geq 0$
----------------------------------------------------------------------------------------------------------------------------
-----------------
Dual form:
---------- --------------------------------------------------------------------------------------------------------------------------
max $\sum_{i=1}^n\alpha_i-\frac{1}{2}\sum_{i,j=1}^n{{y}}_i{{y}}_j\alpha_i\alpha_j({{\x_i}}\cdot{{\x_j}})$ w.r.t. $\alpha_i$
s.t. $0\leq\alpha_i {\color{red}{\leq C}}$, $\sum_{i=1}^n\alpha_i{{y}}_i=0$
------------------------------------------------------------------------------------------------------------------------------------
- All the previously described machinery can be used to solve this
problem