Linear Models

2016-02-09

Linear models in general HTF Ch. 2.8.3

By linear models, we mean that the hypothesis function $h_{\bf w}({\bf x})$ is a linear function of the parameters ${\bf w}$.
Predictions are a linear combination of feature values
\[h_{\bf w}({\mathbf{x}}) = \sum_{k=0}^{p} w_k \phi_k({\mathbf{x}}) = {{\boldsymbol{\phi}}}({\mathbf{x}})^{\mathsf{T}}{{\mathbf{w}}}\] where $\phi_k$ are called basis functions (or features!) As usual, we let $\phi_0({\mathbf{x}})=1, \forall {\mathbf{x}}$, to create a bias.
To recover degree-$d$ polynomial regression in one variable, set \[\phi_0(x) = 1, \phi_1(x) = x, \phi_2(x) = x^2, ..., \phi_d(x) = x^d\]
Basis functions are fixed for a given analysis

Linear Methods for Classification

Classification tasks
Error functions for classification
Logistic Regression
Generalized Linear Models
Support Vector Machines

Example: Given nucleus radius, predict cancer recurrence

ggplot(bc,aes(Radius.Mean,fill=Outcome,color=Outcome)) + geom_density(alpha=I(1/2))

Example: Solution by linear regression

Univariate real input: nucleus size
Output coding: non-recurrence = 0, recurrence = 1
Sum squared error minimized by the blue line

Linear regression for classification

The predictor shows an increasing trend towards recurrence with larger nucleus size, as expected.
Output cannot be directly interpreted as a class prediction.
Thresholding output (e.g., at 0.5) could be used to predict 0 or 1.
(In this case, prediction would be 0 except for extremely large nucleus size.)
Interpret as probability? Not bounded to $[0,1]$, not consistent even for well-separated data

Probabilistic view

Suppose we have two possible classes: $y\in \{0,1\}$.
The symbols “$0$” and “$1$” are unimportant. Could have been $\{a,b\}$, $\{\mathit{up},\mathit{down}\}$, whatever. We’ll use $y\in \{0,1\}$ though.
Rather than try to predict the class label directly, ask:
What is the probability that a given input ${\mathbf{x}}$ to has class $y=1$?
Bayes Rule:

\[P(y=1|{\mathbf{x}}) = \frac{P({\mathbf{x}}, y=1)}{P({\mathbf{x}})} = \frac{P({\mathbf{x}}| y=1)P(y=1)}{P({\mathbf{x}}|y=1)P(y=1)+P({\mathbf{x}}|y=0)P(y=0)} \]

Probabilistic models for binary classification

Can also write: \[P(y=1|{\mathbf{x}})=\sigma\left(\log\frac{P(y=1|{\mathbf{x}})}{P(y=0|{\bf x})}\right) = \sigma\left(\log\frac{P({\mathbf{x}}|y=1)P(y=1)}{P({\mathbf{x}}|y=0)P(y=0)}\right)\] where $\sigma(a) = \frac{1}{1+\exp(-a)}$, the sigmoid or logistic function.
Discriminative Learning:
- Model$\log\frac{P(y=1|{\mathbf{x}})}{P(y=0|{\mathbf{x}})}$ (log odds ratio) as a function of $\mathbf{x}$
- Only models how to discriminate between examples of the two classes. Does not model distribution of $\mathbf{x}$.
Generative Learning:
- Model $P(y=1), P(y=0), P({\mathbf{x}}|y=1), P({\mathbf{x}}|y=0)$, then use rightmost formula above
- Models the full joint; can actually use the model to generate (i.e. fantasize) data

Logistic regression HTF (Ch. 4.4)

Represent the hypothesis as a logistic function of a linear combination of inputs: \[h({\mathbf{x}}) = \sigma({\mathbf{x}}^{\mathsf{T}}{\mathbf{w}})\]
Interpret $h({\mathbf{x}})$ as $P(y=1|{\mathbf{x}})$, interpret ${\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}$ as the log-odds ratio.
How do we choose ${\bf w}$?
In the probabilistic framework, observing $\langle {\mathbf{x}}_i , 1 \rangle$ ( $\langle {\mathbf{x}}_i , 0 \rangle$ ) does not mean $h({\mathbf{x}}_i)$ should be $1$ ($0$)
Maximize probability of having observed the $y_i$, given the ${\mathbf{x}}_i$.

Max Conditional Likelihood

Maximize probability of having observed the $y_i$, given the ${\mathbf{x}}_i$.
Assumption 1: Examples are i.i.d. Probability of observing all $y$s is product \[\begin{gathered} P(Y_1=y_1, Y_2=y_2, ..., Y_n = y_n|X_1 = {\mathbf{x}}_1, X_2 = {\mathbf{x}}_2, ..., X_n = {\mathbf{x}}_n) \\ = \prod_{i=1}^n P(Y_i = y_i | X_i = {\mathbf{x}}_i)\end{gathered}\]
Assumption 2: \[\begin{aligned} P(y = 1|{\mathbf{x}}) & = h_{\mathbf{w}}({\mathbf{x}}) = \sigma({\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}) = 1 / (1 + \exp(-{\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}))\\ P(y = 0|{\mathbf{x}}) & = (1 - \sigma({\mathbf{x}}^{\mathsf{T}}{\mathbf{w}})) = \exp(-{\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}) / (1 + \exp(-{\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}))\\\end{aligned}\]
Probability will underflow; use log probability instead. Therefore \[\begin{aligned} \hspace{-2em} \log \prod_{i=1}^n P(Y_i = y_i | X_i = {\mathbf{x}}_i) & = \sum_{i = 1}^n \left[y_i \log( h_{\mathbf{w}}({\mathbf{x}}_i)) + (1 - y_i) \log (1 - h_{\mathbf{w}}({\mathbf{x}}_i))\right]\end{aligned}\]

Min Cross-Entropy

Maximize probability of having observed the $y_i$, given the ${\mathbf{x}}_i$.
More stable to maximize log probability. Note

\[\begin{aligned} \log P(Y_i = y_i | X_i = {\mathbf{x}}_i) & = \left\{ \begin{array}{ll} \log h_{\mathbf{w}}({\mathbf{x}}_i) & \mbox{if}~y_i=1 \\ \log(1-h_{\mathbf{w}}({\mathbf{x}}_i)) & \mbox{if}~y_i=0 \end{array} \right. \end{aligned} \]

Therefore,

\[\log \prod_{i=1}^n P(Y_i = y_i | X_i = {\mathbf{x}}_i) = \sum_{i = 1}^n \left[y_i \log( h_{\mathbf{w}}({\mathbf{x}}_i)) + (1 - y_i) \log (1 - h_{\mathbf{w}}({\mathbf{x}}_i))\right] \]

Suggests an error \[\begin{aligned} \hspace{-2em} J(h_{{\mathbf{w}}}) = - \sum_{i = 1}^n \left[y_i \log( h_{\mathbf{w}}({\mathbf{x}}_i)) + (1 - y_i) \log (1 - h_{\mathbf{w}}({\mathbf{x}}_i))\right]\end{aligned}\]
This is the cross entropy. Number of bits to transmit $y_i$
if both parties know $h_{\mathbf{w}}$ and ${\mathbf{x}}_i$.

Back to the breast cancer problem

Logistic Regression:

## (Intercept) Radius.Mean 
##  -3.4671348   0.1296493

Least Squares:

## (Intercept) Radius.Mean 
## -0.17166939  0.02349159

Supervised Learning Methods: “Objective-driven”

Mthd.	Form	Objective
OLS	$h_w({\mathbf{x}}) = {\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}$	$\sum_{i=1}^n (h_{\mathbf{w}}({\mathbf{x}}_i) - y_i)^2$
	$\approx E[Y=y\|\mathbf{X}={\mathbf{x}}]$…	…using a linear function
LR	$h_w({\mathbf{x}}) = \frac{1}{1 + \mathrm{e}^{-{\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}}}$	$-\sum_{i=1}^n y_i \log h_{\mathbf{w}}({\mathbf{x}}_i) + (1-y_i) \log (1-h_{\mathbf{w}}({\mathbf{x}}_i))$
	$\approx P(Y=y\|\mathbf{X}={\mathbf{x}})$…	…using sigmoid of a linear function

Both model the conditional mean of $y$ using a (transformed) linear function
Both use maximum conditional likelihood to estimate

Generalized Linear Models

Model the conditional mean $Y|{\mathbf{X}}$, denoted $_
Assumption: $g(\hat\mu_{\mathbf{x}}) = {\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}$
$g$ is the link function
- Linear regression: ${\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}= \hat\mu_{\mathbf{x}}$, $\hat\mu_{\mathbf{x}}= {\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}$
  - Identity link: $g(y) = y$
- Logistic regression: ${\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}= \ln \frac{\hat\mu_{\mathbf{x}}}{1 - \hat\mu_{\mathbf{x}}}$, $\hat\mu_{\mathbf{x}}= \frac{1}{1 + \mathrm{e}^{-{\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}}}$
  - Logit link: $g(y) = \ln \frac{y}{1 - y}$

Poisson Distribution

Poisson Regression

Assume $Y|X$ is Poisson
$\hat\lambda_{\mathbf{x}}= \hat\mu_{\mathbf{x}}= \mathrm{e}^{{\mathbf{w}}^{\mathsf{T}}{\mathbf{x}}}$
${\mathbf{w}}^{\mathsf{T}}{\mathbf{x}}= \ln \hat\lambda_{\mathbf{x}}= \ln\hat\mu_{\mathbf{x}}$
Link function is $g(y) = \ln y$

Horseshoe Crabs

##    Satellites         Width       Dark     GoodSpine
##  Min.   : 0.000   Min.   :21.0   no :107   no :121  
##  1st Qu.: 0.000   1st Qu.:24.9   yes: 66   yes: 52  
##  Median : 2.000   Median :26.1                      
##  Mean   : 2.919   Mean   :26.3                      
##  3rd Qu.: 5.000   3rd Qu.:27.7                      
##  Max.   :15.000   Max.   :33.5

Horseshoe Crabs

Poisson Regression

preg <- glm(data=crabs,formula=Satellites ~ Width * Dark * GoodSpine,family="poisson"); summary(preg)

## 
## Call:
## glm(formula = Satellites ~ Width * Dark * GoodSpine, family = "poisson", 
##     data = crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9448  -1.9738  -0.4940   0.9552   4.6511  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -3.41436    1.00512  -3.397 0.000681 ***
## Width                       0.17127    0.03656   4.685 2.81e-06 ***
## Darkyes                    -1.04896    1.65607  -0.633 0.526472    
## GoodSpineyes                2.26862    1.32812   1.708 0.087610 .  
## Width:Darkyes               0.02991    0.06200   0.482 0.629544    
## Width:GoodSpineyes         -0.08400    0.04850  -1.732 0.083293 .  
## Darkyes:GoodSpineyes       -7.40779    3.48306  -2.127 0.033436 *  
## Width:Darkyes:GoodSpineyes  0.27509    0.12655   2.174 0.029723 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 549.49  on 165  degrees of freedom
## AIC: 920.79
## 
## Number of Fisher Scoring iterations: 6

Poisson Regression

Negative Binomial Regression

library(MASS)
nbreg <- glm.nb(data=crabs,formula=Satellites ~ Width * Dark * GoodSpine)
summary(nbreg)

## 
## Call:
## glm.nb(formula = Satellites ~ Width * Dark * GoodSpine, data = crabs, 
##     init.theta = 0.9563926918, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8195  -1.3910  -0.2512   0.4458   2.3138  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                 -3.922836   2.212990  -1.773   0.0763 .
## Width                        0.189965   0.081784   2.323   0.0202 *
## Darkyes                     -0.356561   3.229882  -0.110   0.9121  
## GoodSpineyes                 2.313754   2.916082   0.793   0.4275  
## Width:Darkyes                0.004143   0.122437   0.034   0.9730  
## Width:GoodSpineyes          -0.085398   0.108215  -0.789   0.4300  
## Darkyes:GoodSpineyes       -12.998600   7.323821  -1.775   0.0759 .
## Width:Darkyes:GoodSpineyes   0.482453   0.272911   1.768   0.0771 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.9564) family taken to be 1)
## 
##     Null deviance: 219.61  on 172  degrees of freedom
## Residual deviance: 195.88  on 165  degrees of freedom
## AIC: 763.6
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.956 
##           Std. Err.:  0.174 
## 
##  2 x log-likelihood:  -745.598

Negative Binomial Regression

Dummy Variables

For all regressions, $x_i$ have to be numbers.
For a categorical variable (factor) with $k$ levels, create $k-1$ binary features

head(crabs$Dark)

## [1] no  yes no  yes yes no 
## Levels: no yes

summary(lm(formula=Width ~ Dark,data=crabs))

## 
## Call:
## lm(formula = Width ~ Dark, data = crabs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5939 -1.5336 -0.0336  1.4061  6.7664 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.7336     0.1973 135.516  < 2e-16 ***
## Darkyes      -1.1397     0.3194  -3.568 0.000466 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.041 on 171 degrees of freedom
## Multiple R-squared:  0.0693, Adjusted R-squared:  0.06386 
## F-statistic: 12.73 on 1 and 171 DF,  p-value: 0.0004662

Other Encodings

Suppose we had a categorical variable Colour with three levels, "Red", "Blue" or "Green"
What's the difference between coding with 2 dummies versus coding as {0,1,2}?

Inference

head(crabs$Dark)

## [1] no  yes no  yes yes no 
## Levels: no yes

summary(lm(formula=Width ~ Dark,data=crabs))

## 
## Call:
## lm(formula = Width ~ Dark, data = crabs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5939 -1.5336 -0.0336  1.4061  6.7664 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.7336     0.1973 135.516  < 2e-16 ***
## Darkyes      -1.1397     0.3194  -3.568 0.000466 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.041 on 171 degrees of freedom
## Multiple R-squared:  0.0693, Adjusted R-squared:  0.06386 
## F-statistic: 12.73 on 1 and 171 DF,  p-value: 0.0004662

P-values

Null hypothesis: True value of a coefficient is zero.
Observe data, compute estimate of coefficient $w_i$. Probably not exactly zero.
If the null were true, what is the chance we would observe an estimate $w$ with $|w| \ge |w_i|$?
- Comes from sampling distribution, via CLT or bootstrap.
If small enough, evidence for rejecting the null.
Says nothing about the importance of a coefficient to the model.

Regression Diagnostics

Decision boundary HTF Ch. 2.3.1,2.3.2

How complicated is a classifier?
One way to think about it is in terms of its decision boundary, i.e. the line it defines for separating examples
Linear classifiers draw a hyperplane between examples of the different classes. Non-linear classifiers draw more complicated surfaces between the different classes.
For a probabilistic classifier with a cutoff of 0.5,
the decision boundary is the curve on which: \[\frac{P(y=1|{\mathbf{x}})}{P(y=0|{\mathbf{x}})} = 1, \mbox{i.e., where } \log\frac{P(y=1|{\mathbf{x}})}{P(y=0|{\mathbf{x}})} = 0\]

Decision boundary

Class = R if ${\mathrm{Pr}}(Y=1|X=x) > 0.5$

Decision boundary

Class = R if ${\mathrm{Pr}}(Y=1|X=x) > 0.25$

Decision boundary

Class = R if ${\mathrm{Pr}}(Y=1|X=x) > 0.5$

Decision boundary

Class = R if ${\mathrm{Pr}}(Y=1|X=x) > 0.25$

Supervised Learning Methods: “Objective-driven”

Mthd.	Form	Objective
OLS	$h_w({\mathbf{x}}) = {\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}$	$\sum_{i=1}^n(h_{\mathbf{w}}({\mathbf{x}}_i) - y_i)^2$
	$\approx E[Y=y\|\mathbf{X}={\mathbf{x}}]$…	…using a linear function
LR	$h_{\mathbf{w}}({\mathbf{x}}) = \frac{1}{1 + \mathrm{e}^{-{\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}}}$	-$\sum_{i=1}^n y_i \log h_{\mathbf{w}}({\mathbf{x}}_i) + (1-y_i) \log (1-h_{\mathbf{w}}({\mathbf{x}}_i))$
	$\approx P(Y=y\|\mathbf{X}={\mathbf{x}})$…	…using sigmoid of a linear function
SVM	$h_{\mathbf{w}}({\mathbf{x}}) = \mathrm{sgn}({\mathbf{x}}^{\mathsf{T}}{\mathbf{w}})$
		…using a linear function

Large Margin Classifiers:
Linear Support Vector Machines

Linear classifiers that focus on learning the decision boundary rather than the conditional distribution $P(Y=y|\mathbf{X}={\mathbf{x}})$
- Perceptrons
  - Definition
  - Perceptron learning rule
  - Convergence
- “Margin” idea and max margin classifiers
- (Linear) support vector machines
  - Formulation as optimization problem

Marvin Minsky, 1927-2016

Perceptrons HTF Ch. 4.5

Consider a binary classification problem with data $\{{{{\mathbf{x}}_i}},y_i\}_{i=1}^n$, $y_i\in\{-1,+1\}$. Note coding of $y_i$.
A perceptron (Rosenblatt, 1957) is a classifier of the form: \[h_{{{\mathbf{w}}},w_0}({{{\mathbf{x}}}}) = \mbox{sign}({\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}+ w_0) = \left\{ \begin{array}{ll} +1 & \mathrm{if}~ {\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}+ w_0\geq 0 \\ -1 & \mathrm{otherwise} \end{array} \right.\] Here, ${{\mathbf{w}}}$ is a vector of weights, and $w_0$ is a constant offset. (Note $x_0 = 1$ is omitted.)
The decision boundary is ${\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}+ w_0= 0$.
Perceptrons output a class, not a probability
An example $( {{{\mathbf{x}}}}, y )$ is classified correctly if: \[y \cdot ({\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}+ w_0) > 0\]

Linear separability

The data set is linearly separable if and only if there exists ${{\mathbf{w}}}$, $w_0$ such that:
- For all $i$, $y_i({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+w_0)>0$.
- Or equivalently, the 0-1 loss $\sum_i \mathbf{1}_{y_i({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+w_0) < 0}$ is zero for some set of parameters $({\bf w}, w_0)$.

Linear Separability

The Perceptron Learning Rule

Consider the following procedure:
1. Initialize ${{\mathbf{w}}}$ and $w_0$ randomly
2. While any training examples remain incorrecty classified
  1. Loop through all misclassified examples
  2. For misclassified example $i$, perform the updates: \[{{\mathbf{w}}}\gets {{\mathbf{w}}}+ \delta y_i{{{\mathbf{x}}}}_i,~~~~~w_0\gets w_0 + \delta y_i\] where $\delta$ is a step-size parameter.
The update equation, or sometimes the whole procedure, is called the perceptron learning rule.
Intuition: For positive examples misclassified as negative, change ${{\mathbf{w}}}$ to increase ${\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+w_0$, and vice versa

Error Minimization Interpretation

PLR can be interpreted as a gradient descent on the following function: \[{{J}}({{\mathbf{w}}},w_0) = \sum_{i=1}^n \left\{ \begin{array}{ll} 0 & \mathrm{if}~ y_i({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+ w_0)\geq 0 \\ -y_i({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+ w_0) & \mathrm{if}~ y_i({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+ w_0)<0 \end{array}\right.\]
For correctly classified examples, the error is zero.
For incorrectly classified examples, the error is by how much ${\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+w_0$ is on the wrong side of the decision boundary.
${{\mathrm{Err}}}$ is piecewise linear, so it has a gradient almost everywhere; stochastic gradient descent gives the perceptron learning rule.
$J$ is zero if and only if all examples are classified correctly – just like the 0-1 loss function.

Perceptron convergence theorem

If data are linearly separable then the perceptron learning rule will find a separater after some finite number of updates.
The number of updates depends on the data set, and also on the step size parameter.
If the data is not linearly separable, there will be oscillation (which can be detected automatically).

Perceptron Learning Example

Weight as a combination of input vectors

Recall percepton learning rule: \[{{\mathbf{w}}}\gets {{\mathbf{w}}}+ \delta y_i{{{\mathbf{x}}}}_i,~~~~~w_0\gets w_0 + \delta y_i\]
If initial weights are zero, then at any step, the weights are a linear combination of feature vectors of the examples: \[{{\mathbf{w}}}= \sum_{i=1}^n \alpha_i y_i {{{\mathbf{x}}_i}},~~~~~w_0 =\sum_{i=1}^n \alpha_i y_i\] where $\alpha_i$ is the sum of step sizes used for all updates based on example $i$.
This is called the dual representation of the classifier.
Even by the end of training, some example may have never participated in an update (they were always correct) so the corresponding $\alpha_i=0$.

Examples used (bold) and not used (faint) in updates

Comment: Solutions are nonunique

Perceptron summary

Perceptrons can be learned to fit linearly separable data, using a gradient descent rule.
Blindingly fast
Solutions are non-unique

Support Vector Machines

Support vector machines (SVMs) for binary classification can be viewed as a way of training perceptrons
Three main new ideas:
- A optimization criterion (the "margin") guarantees uniqueness and has theoretical advantages
- Natural handling nonseparable data by allowing mistakes
- An efficient way of operating in expanded feature spaces: "kernel trick"
SVMs can also be used for multiclass classification and regression.

Returning to the non-uniqueness issue

Consider a linearly separable binary classification data set
There is an infinite number of hyperplanes that separate the classes:

Which plane is best?
For a given plane, for which points should we be most confident in the classification?

The margin, and linear SVMs

For a given separating hyperplane, the margin is two times the (Euclidean) distance from the hyperplane to the nearest training example.
Width of the "strip" around the decision boundary containing no training examples.
A linear SVM is a perceptron for which we choose ${{\mathbf{w}}},w_0$ so that margin is maximized

Distance to the decision boundary

Suppose we have a decision boundary that separates the data.
Let $\gamma_i$ be the distance from instance ${{{\mathbf{x}}_i}}$ to the decision boundary.
How can we write $\gamma_i$ in terms of ${{{\mathbf{x}}_i}}, y_i, {{\mathbf{w}}}, w_0$?

Distance to the decision boundary (II)

${{\mathbf{w}}}$ is normal to the decision boundary. Thus, $\frac{\mathbf{w}}{||{{\mathbf{w}}}||}$ is the unit normal of the boundary.
Vector from B to ${\mathbf{x}}_i$ is $\gamma_i \frac{\mathbf{w}}{||{{\mathbf{w}}}||}$.
B, the point on the boundary nearest ${{{\mathbf{x}}_i}}$, is ${{{\mathbf{x}}_i}}-\gamma_i \frac{\mathbf{w}}{||{{\mathbf{w}}}||}$.
Since B is on the boundary, \[\left({{{\mathbf{x}}_i}}-\gamma_i \frac{\mathbf{w}}{||{{\mathbf{w}}}||}\right)^{\mathsf{T}}{\mathbf{w}}+ w_0 = 0\]
Solving for $\gamma_i$ yields \[\gamma_i = \frac{{\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+ w_0}{||{\mathbf{w}}||}\]

The margin HTF Ch. 4.5, Ch 12

The margin of the hyperplane is $2M$, where $M=\min_i y_i \gamma_i$
The most direct statement of the problem of finding a maximum margin separating hyperplane is thus

\[\max_{{\mathbf{w}},w_0} \min_i y_i \gamma_i \equiv \max_{{\mathbf{w}},w_0} \min_i y_i\frac{{\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+ w_0}{||{\mathbf{w}}||}\]
This turns out to be inconvenient for optimization, however

Treating the $\gamma_i$ as constraints

From the definition of margin, we have: \[M \leq y_i \gamma_i = y_i \frac{{\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+ w_0}{||{\mathbf{w}}||} ~~~~\forall i\]
This suggests:

maximize $M$	with respect to $M, {{\mathbf{w}}}, w_0$
	subject to $M \leq y_i \frac{{\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+ w_0}{\|\|{\mathbf{w}}\|\|}$ for all $i$

Problems:
- ${{\mathbf{w}}}$ appears nonlinearly in the constraints.
- This problem is underconstrained. If $({{\mathbf{w}}},w_0,M)$ is an optimal solution, then so is $(\beta{{\mathbf{w}}},\beta w_0,M)$ for any $\beta>0$.

Adding a constraint

Let’s add the constraint that $M = 1 / \|{{\mathbf{w}}}\|$:

This allows us to rewrite the objective function:
This is really nice because the constraints are linear.

maximize $\frac{1}{\|\|{\mathbf{w}}\|\|}$	with respect to ${{\mathbf{w}}}, w_0$
	subject to $\frac{1}{\|\|{\mathbf{w}}\|\|} \leq y_i \frac{{\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+ w_0}{\|\|{\mathbf{w}}\|\|}$ for all $i$

which is the same as

maximize $\frac{1}{\|\|{\mathbf{w}}\|\|}$	with respect to ${{\mathbf{w}}}, w_0$
	subject to $1 \le y_i ({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+ w_0)$ for all $i$

Final formulation

Let’s minimize $\|{{\mathbf{w}}}\|^2$ instead of maximizing $\frac{1}{||{\mathbf{w}}||}$. (Taking the square is a monotone transformation, as $\|{{\mathbf{w}}}\|$ is postive, so this doesn’t change the optimal solution.)
This gets us to:

minimize $\|{{\mathbf{w}}}\|^2$ w.r.t. ${{\mathbf{w}}}, w_0$

subject to $y_i({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+w_0)\geq1$
This we can solve! How?
- It is a convex quadratic programming (QP) problem—a standard type of optimization problem for which many efficient packages are available.

Example

We have a solution, but no “support vectors” yet…

What are "Support Vectors"?

minimize

$\frac{1}{2} \|{{\mathbf{w}}}\|^2$ w.r.t. ${{\mathbf{w}}}, w_0$

subject to

$y_i({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+w_0)\geq1$

Turns out (HTF Ch. 4.5.2) we can write: \[{\bf w}=\sum_i \alpha_i y_i {\mathbf{x}}_i,~~\mbox{where $\alpha_i \ge 0$}\]
As for the perceptron with zero initial weights, the optimal solution for ${{\mathbf{w}}}$ and $w_0$ is a linear combination of the ${{{\mathbf{x}}_i}}$.
The output is therefore:

\[h_{{\mathbf{w}},w_0}({\mathbf{x}}) = \mbox{sign} \left(\sum_{i=1}^n \alpha_i y_i ({{{\mathbf{x}}_i}}\cdot {{{\mathbf{x}}}}) +w_0\right)\]
Output depends on weighted dot product of input vector with training examples

Solving “the dual”

We can actually solve directly for the $\alpha_i$ (again see HTF Ch. 4.5.2): \[\max_{{{\boldsymbol{\alpha}}}} \sum_{i=1}^n \alpha_i -\frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n y_i y_j \alpha_i \alpha_j ({\mathbf{x}}_i \cdot {\mathbf{x}}_j)\] with constraints: $\alpha_i \geq 0 \mbox{ and} \sum_i \alpha_i y_i =0$
This is also a QP

The support vectors

Suppose we find optimal ${{\boldsymbol{\alpha}}}$s (e.g., using a standard QP package)
The $\alpha_i$ will be $>0$ only for the points for which $y_i({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+ w_0)=1$
These are the points lying on the edge of the margin, and they are called support vectors, because they define the decision boundary
The output of the classifier for query point ${\mathbf{x}}$ is computed as: \[\mbox{sgn}\left[\left(\sum_{i=1}^n \alpha_i y_i ({\mathbf{x}}_i \cdot {\mathbf{x}})\right) + w_0 \right]\] Hence, the output is determined by computing the dot product of the point with the support vectors!

Example

Support vectors are in bold

But why all this work?

SVMs are a state-of-the-art for classification when you don’t need probability estimates
Inuitively, the large-margin property makes sense. Theory backs this up.
SVMs offer “off-the-shelf” non-linear classification without having to do explicit feature construction, as we will see.

Soft margin classifiers

Recall that in the linearly separable case, we compute the solution to the following optimization problem:

min $\frac{1}{2}\|{{\mathbf{w}}}\|^2$ w.r.t. ${{\mathbf{w}}}, w_0$

s.t. $y_i({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+ w_0)\geq1$
What if we can't satisfy the constraints?

Soft margin classifiers

To allow misclassifications, we relax the constraints to: \[y_i({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+ w_0) \geq 1-\xi_i\]
If $\xi_i \in (0,1)$, the data point is within the margin
If $\xi_i \geq 1$, then the data point is misclassified
We define the soft error as $\sum_i \xi_i$; each $\xi_i$ is a slack variable

Problem formulation with soft errors

Instead of:

min $\frac{1}{2}\|{{\mathbf{w}}}\|^2$ w.r.t. ${{\mathbf{w}}}, w_0$

s.t. $y_i({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+ w_0)\geq1$

we want to solve:

min $\frac{1}{2}\|{{\mathbf{w}}}\|^2+ C \sum_i \xi_i$ w.r.t. ${{\mathbf{w}}}, w_0, \xi_i$

s.t. $y_i({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+ w_0)\geq1-\xi_i$, $\xi_i \geq 0$
Note that soft errors include points that are misclassified,
as well as points within the margin
There is a linear penalty for both categories
The choice of the constant $C$ controls boundary-fitting

A built-in boundary-fitting knob

min	$\frac{1}{2}\\|{{\mathbf{w}}}\\|^2+ C \sum_i \xi_i$
w.r.t.	${{\mathbf{w}}}, w_0, \xi_i$
s.t.	$y_i({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+w_0) \geq 1-\xi_i$, $\xi_i \geq 0$

If $C$ is $0$, there is no penalty for soft errors, so the focus is on maximizing the margin, even if this means more mistakes
If $C$ is very large, the emphasis on the soft errors will decrease the margin, if this helps to classify more examples correctly.
Internal cross-validation is a good way to choose $C$ appropriately

Dual form for the soft margin problem

Like before, we can formulate a “dual” problem that identifies the support vectors:

	Primal form:
min	$\\|{{\mathbf{w}}}\\|^2+{\color{MyRed} C\sum_i\xi_i}$ w.r.t. ${{\mathbf{w}}}, w_0, {\color{MyRed}\xi_i}$
s.t.	${{y}}_i({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+w_0)\geq {\color{MyRed}(1-\xi_i)}$, $\xi_i\geq 0$

	Dual form:
max	$\sum_{i=1}^n\alpha_i-\frac{1}{2}\sum_{i,j=1}^n{{y}}_i{{y}}_j\alpha_i\alpha_j({{{\mathbf{x}}_i}}\cdot{{{\mathbf{x}}_j}})$ w.r.t. $\alpha_i$
s.t.	$0\leq\alpha_i {\color{MyRed}\leq C}$, $\sum_{i=1}^n\alpha_i{{y}}_i=0$

All the previously described machinery can be used to solve this problem

min	\(\frac{1}{2}\\|{{\mathbf{w}}}\\|^2\)	w.r.t. \({{\mathbf{w}}}, w_0\)
s.t.	\(y_i({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+ w_0)\geq1\)

Mthd.	Form	Objective
OLS	\(h_w({\mathbf{x}}) = {\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}\)	\(\sum_{i=1}^n (h_{\mathbf{w}}({\mathbf{x}}_i) - y_i)^2\)
	\(\approx E[Y=y\|\mathbf{X}={\mathbf{x}}]\)…	…using a linear function
LR	\(h_w({\mathbf{x}}) = \frac{1}{1 + \mathrm{e}^{-{\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}}}\)	\(-\sum_{i=1}^n y_i \log h_{\mathbf{w}}({\mathbf{x}}_i) + (1-y_i) \log (1-h_{\mathbf{w}}({\mathbf{x}}_i))\)
	\(\approx P(Y=y\|\mathbf{X}={\mathbf{x}})\)…	…using sigmoid of a linear function

Mthd.	Form	Objective
OLS	\(h_w({\mathbf{x}}) = {\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}\)	\(\sum_{i=1}^n(h_{\mathbf{w}}({\mathbf{x}}_i) - y_i)^2\)
	\(\approx E[Y=y\|\mathbf{X}={\mathbf{x}}]\)…	…using a linear function
LR	\(h_{\mathbf{w}}({\mathbf{x}}) = \frac{1}{1 + \mathrm{e}^{-{\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}}}\)	-\(\sum_{i=1}^n y_i \log h_{\mathbf{w}}({\mathbf{x}}_i) + (1-y_i) \log (1-h_{\mathbf{w}}({\mathbf{x}}_i))\)
	\(\approx P(Y=y\|\mathbf{X}={\mathbf{x}})\)…	…using sigmoid of a linear function
SVM	\(h_{\mathbf{w}}({\mathbf{x}}) = \mathrm{sgn}({\mathbf{x}}^{\mathsf{T}}{\mathbf{w}})\)
		…using a linear function

maximize \(M\)	with respect to \(M, {{\mathbf{w}}}, w_0\)
	subject to \(M \leq y_i \frac{{\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+ w_0}{\|\|{\mathbf{w}}\|\|}\) for all \(i\)

maximize \(\frac{1}{\|\|{\mathbf{w}}\|\|}\)	with respect to \({{\mathbf{w}}}, w_0\)
	subject to \(\frac{1}{\|\|{\mathbf{w}}\|\|} \leq y_i \frac{{\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+ w_0}{\|\|{\mathbf{w}}\|\|}\) for all \(i\)

maximize \(\frac{1}{\|\|{\mathbf{w}}\|\|}\)	with respect to \({{\mathbf{w}}}, w_0\)
	subject to \(1 \le y_i ({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+ w_0)\) for all \(i\)

minimize	\(\\|{{\mathbf{w}}}\\|^2\) w.r.t. \({{\mathbf{w}}}, w_0\)
subject to	\(y_i({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+w_0)\geq1\)

min	\(\frac{1}{2}\\|{{\mathbf{w}}}\\|^2\) w.r.t. \({{\mathbf{w}}}, w_0\)
s.t.	\(y_i({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+ w_0)\geq1\)

min	\(\frac{1}{2}\\|{{\mathbf{w}}}\\|^2+ C \sum_i \xi_i\) w.r.t. \({{\mathbf{w}}}, w_0, \xi_i\)
s.t.	\(y_i({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+ w_0)\geq1-\xi_i\), \(\xi_i \geq 0\)

min	\(\frac{1}{2}\\|{{\mathbf{w}}}\\|^2+ C \sum_i \xi_i\)
w.r.t.	\({{\mathbf{w}}}, w_0, \xi_i\)
s.t.	\(y_i({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+w_0) \geq 1-\xi_i\), \(\xi_i \geq 0\)

	Primal form:
min	\(\\|{{\mathbf{w}}}\\|^2+{\color{MyRed} C\sum_i\xi_i}\) w.r.t. \({{\mathbf{w}}}, w_0, {\color{MyRed}\xi_i}\)
s.t.	\({{y}}_i({\mathbf{x}}_i^{\mathsf{T}}{\mathbf{w}}+w_0)\geq {\color{MyRed}(1-\xi_i)}\), \(\xi_i\geq 0\)

	Dual form:
max	\(\sum_{i=1}^n\alpha_i-\frac{1}{2}\sum_{i,j=1}^n{{y}}_i{{y}}_j\alpha_i\alpha_j({{{\mathbf{x}}_i}}\cdot{{{\mathbf{x}}_j}})\) w.r.t. \(\alpha_i\)
s.t.	\(0\leq\alpha_i {\color{MyRed}\leq C}\), \(\sum_{i=1}^n\alpha_i{{y}}_i=0\)

Linear models in general HTF Ch. 2.8.3

Linear Methods for Classification

Example: Given nucleus radius, predict cancer recurrence

Example: Solution by linear regression

Linear regression for classification

Probabilistic view

Probabilistic models for binary classification

Logistic regression HTF (Ch. 4.4)

Max Conditional Likelihood

Min Cross-Entropy

Back to the breast cancer problem

Supervised Learning Methods: “Objective-driven”

Generalized Linear Models

Poisson Distribution

Poisson Distribution

Poisson Regression

Horseshoe Crabs

Horseshoe Crabs

Poisson Regression

Poisson Regression

Negative Binomial Regression

Negative Binomial Regression

Dummy Variables

Other Encodings

Inference

P-values

Regression Diagnostics

Decision boundary HTF Ch. 2.3.1,2.3.2

Decision boundary

Decision boundary

Decision boundary

Decision boundary

Supervised Learning Methods: “Objective-driven”

Large Margin Classifiers:Linear Support Vector Machines

Marvin Minsky, 1927-2016

Perceptrons HTF Ch. 4.5

Linear separability

Linear Separability

The Perceptron Learning Rule

Error Minimization Interpretation

Perceptron convergence theorem

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Perceptron Learning Example

Weight as a combination of input vectors

Examples used (bold) and not used (faint) in updates

Comment: Solutions are nonunique

Perceptron summary

Support Vector Machines

Returning to the non-uniqueness issue

The margin, and linear SVMs

Distance to the decision boundary

Distance to the decision boundary (II)

The margin HTF Ch. 4.5, Ch 12

Treating the \(\gamma_i\) as constraints

Large Margin Classifiers:
Linear Support Vector Machines