--- title: "Linear Models for Classification" author: "Dan Lizotte" date: '`r Sys.Date()`' output: beamer_presentation: default html_document: default ioslides_presentation: css: ../my_ioslides.css pdf_document: null --- ```{r echo=F,message=F,warning=F} library(dplyr);library(ggplot2) wpbc_featurenames = c("Radius","Texture","Perimeter","Area","Smoothness","Compactness","Concavity","Concave Points","Symmetry","Fractal Dim") wpbc_names = c("ID","Outcome","Time",paste(wpbc_featurenames,"Mean"),paste(wpbc_featurenames,"SE"),paste(wpbc_featurenames,"Worst"),"Tumor Diameter","Lymph Status") bc <- read.csv("~dlizotte/Seafile/My Library/Teaching/cs4437/Lectures/6_Linear Models/data/wpbc.data.txt",header=FALSE,col.names=wpbc_names) ``` \newcommand{\T}{\mathsf{T}} \newcommand{\x}{\mathbf{x}} \newcommand{\X}{\mathbf{X}} \newcommand{\w}{\mathbf{w}} ## Linear models in general HTF Ch. 2.8.3 - By linear models, we mean that the hypothesis function $h_{\bf w}({\bf x})$ is a *linear function of the parameters ${\bf w}$*. - Predictions are a *linear combination of feature values* - $$h_{\bf w}(\x) = \sum_{k=0}^{p} w_k \phi_k(\x) = {{\boldsymbol{\phi}}}(\x)^\T {{\mathbf{w}}}$$ where $\phi_k$ are called *basis functions* As usual, we let $\phi_0(\x)=1, \forall \x$, to create a bias. - To recover degree-$d$ polynomial regression in one variable, set $$\phi_0(x) = 1, \phi_1(x) = x, \phi_2(x) = x^2, ..., \phi_d(x) = x^d$$ - Basis functions are *fixed* for training ## Linear Methods for Classification - Classification tasks - Error functions for classification - Logistic Regression - Support Vector Machines ## Example: Given nucleus radius, predict cancer recurrence {.smaller} ```{r} ggplot(bc,aes(Radius.Mean,fill=Outcome,color=Outcome)) + geom_density(alpha=I(1/2)) ``` ## Example: Solution by linear regression - Univariate real input: nucleus size - Output coding: non-recurrence = 0, recurrence = 1 - **Sum squared error** minimized by the blue line ```{r echo=F} bc <- bc %>% mutate(binOutcome = as.numeric(Outcome == "R")) ggplot(bc,aes(x=Radius.Mean,y=binOutcome)) + geom_point(aes(colour=Outcome)) + geom_smooth(method="lm",se=F) ``` ## Linear regression for classification - The predictor shows an increasing trend towards recurrence with larger nucleus size, as expected. - Output *cannot be directly interpreted* as a class prediction. - Thresholding output (e.g., at 0.5) could be used to predict 0 or 1.\ (In this case, prediction would be 0 except for extremely large nucleus size.) - Interpret as probability? Not bounded to $[0,1]$, not consistent even for well-separated data ## Probabilistic view - Suppose we have two possible classes: $y\in \{0,1\}$. - The symbols “$0$” and “$1$” are unimportant. Could have been $\{a,b\}$, $\{\mathit{up},\mathit{down}\}$, whatever. - Rather than try to predict the class label directly, ask:\ What is the *probability* that a given input $\x$ to has class $y=1$? - Conditional Probability: $$P(y=1|\X = \x) = \frac{P(\X = \x, y=1)}{P(\X = \x)} $$ - Bayes' Rule $$ = \frac{P(\X = \x| y=1)P(y=1)}{P(\X = \x|y=1)P(y=1)+P(\X = \x|y=0)P(y=0)} $$ ## Probabilistic models for binary classification {.smaller} - Can also write: $$P(y=1|\X = \x)=\sigma\left(\log\frac{P(y=1|\X = \x)}{P(y=0|\X = \x)}\right) = \sigma\left(\log\frac{P(\X = \x|y=1)P(y=1)}{P(\X = \x|y=0)P(y=0)}\right)$$ where $\sigma(a) = \frac{1}{1+\exp(-a)}$, the *sigmoid* or *logistic function*. - Discriminative Learning: - Model $\log\frac{P(y=1|\X = \x)}{P(y=0|\X = \x)}$ (*log odds*) as a function of $\mathbf{x}$ - Only models how to *discriminate* between examples of the two classes. Does not model distribution of $\mathbf{x}$. - Generative Learning: - Model $P(y=1), P(y=0), P(\X = \x|y=1), P(\X = \x|y=0)$, then use rightmost formula above - Models the full joint; can actually use the model to *generate* (i.e. fantasize) data ## Logistic regression HTF (Ch. 4.4) - Represent the hypothesis as a logistic function of a linear combination of inputs: $$h(\x) = \sigma(\x^\T \w)$$ - Interpret $h(\x)$ as $P(y=1|\X = \x)$, interpret $\x^\T \w$ as the log-odds - How do we choose ${\bf w}$? - In the probabilistic framework, observing $\langle \x_i , 1 \rangle$ does not mean $h(\x_i)$ should be as close to $1$ as possible. - **Maximize probability the model assigns to the $y_i$, *given the* $\x_i$**. ## Max Conditional Likelihood {.smaller} - **Maximize probability the model assigns to the $y_i$, *given the* $\x_i$**. - Assumption 1: Examples are i.i.d. Probability of observing all $y$s is product $$\begin{gathered} P(Y_1=y_1, Y_2=y_2, ..., Y_n = y_n|X_1 = \x_1, X_2 = \x_2, ..., X_n = \x_n) \\ = \prod_{i=1}^n P(Y_i = y_i | X_i = \x_i)\end{gathered}$$ - Assumption 2: $$\begin{aligned} P(y = 1|\X = \x) & = h_\w(\x) = 1 / (1 + \exp(-\x^\T \w))\\ P(y = 0|\X = \x) & = (1 - h_\w(\x))\\\end{aligned}$$ ## Max Conditional Likelihood {.smaller} - **Maximize probability the model assigns to the $y_i$, *given the* $\x_i$**. - More stable to maximize log probability. Note $$\begin{aligned} \log P(Y_i = y_i | X_i = \x_i) & = \left\{ \begin{array}{ll} \log h_\w(\x_i) & \mbox{if}~y_i=1 \\ \log(1-h_\w(\x_i)) & \mbox{if}~y_i=0 \end{array} \right. \end{aligned} $$ - Therefore, $$\log \prod_{i=1}^n P(Y_i = y_i | X_i = \x_i) = \sum_{i = 1}^n \left[y_i \log( h_\w(\x_i)) + (1 - y_i) \log (1 - h_\w(\x_i))\right] $$ - Suggests an error $$\begin{aligned} \hspace{-2em} J(h_{\w}) = - \sum_{i = 1}^n \left[y_i \log( h_\w(\x_i)) + (1 - y_i) \log (1 - h_\w(\x_i))\right]\end{aligned}$$ - This is the *cross entropy*. Number of bits to transmit $y_i$
if both parties know $h_\w$ and $\x_i$. ## Back to the breast cancer problem ```{r echo=F} ggplot(bc,aes(x=Radius.Mean,y=binOutcome)) + geom_point(aes(colour=Outcome)) + stat_smooth(method="glm",method.args=list(family="binomial"),se=F,colour="blue") + stat_smooth(method="lm",se=F,linetype=6,colour="red") ```
Logistic Regression: ```{r echo=F} bclr <- glm(Outcome ~ Radius.Mean, data=bc, family="binomial") bclr$coefficients ``` Least Squares: ```{r echo=F} bclm <- lm(binOutcome ~ Radius.Mean, data=bc) bclm$coefficients ```
## Supervised Learning Methods: “Objective-driven” -------------------------------------------------------------------------------------------------------------------------------------- Mthd. Form Objective ------------ -------------------------------------------- ---------------------------------------------------------------------- OLS $h_w(\x) = \x^\T \w$ $\sum_{i=1}^n (h_\w(\x_i) - y_i)^2$ $\approx E[Y=y|\mathbf{X}=\x]$... ...using a linear function LR $h_w(\x) = \frac{1}{1 + \mathrm{e}^{-\x^\T \w}}$ $-\sum_{i=1}^n y_i \log h_\w(\x_i) + (1-y_i) \log (1-h_\w(\x_i))$ $\approx P(Y=y|\mathbf{X}=\x)$... ...using sigmoid of a linear function -------------------------------------------------------------------------------------------------------------------------------------- - Both model the ***conditional mean of $y$*** using a (transformed) ***linear function*** - Both use ***maximum conditional likelihood*** to estimate ## Decision boundary HTF Ch. 2.3.1,2.3.2 - How complicated is a classifier? - One way to think about it is in terms of its *decision boundary*, i.e. the line it defines for separating examples - *Linear classifiers* draw a hyperplane between examples of the different classes. *Non-linear classifiers* draw more complicated surfaces between the different classes. - For a probabilistic classifier with a cutoff of 0.5,\ the decision boundary is the curve on which: $$\frac{P(y=1|\X = \x)}{P(y=0|\X = \x)} = 1, \mbox{i.e., where } \log\frac{P(y=1|\X = \x)}{P(y=0|\X = \x)} = 0$$ \renewcommand{\Pr}{\mathrm{Pr}} ## Decision boundary Class = R if $\Pr(Y=1|X=x) > 0.5$ ```{r echo=F} bc <- bc %>% mutate(binOutcome = as.numeric(Outcome == "R")) bclr <- glm(Outcome ~ Radius.Mean, data=bc, family="binomial") db <- -bclr$coefficients[1]/bclr$coefficients[2] ggplot(bc,aes(x=Radius.Mean,y=binOutcome)) + geom_point(aes(colour=Outcome)) + stat_smooth(method="glm",method.args=list(family="binomial"),se=F) + stat_smooth(method="lm",se=F,linetype=6,color="red") + geom_vline(xintercept=db,color="magenta") ``` ## Decision boundary Class = R if $\Pr(Y=1|X=x) > 0.25$ ```{r echo=F,warning=F,message=F} library(boot) bc <- bc %>% mutate(binOutcome = as.numeric(Outcome == "R")) bclr <- glm(Outcome ~ Radius.Mean, data=bc, family="binomial") db <- (logit(0.25) - bclr$coefficients[1])/bclr$coefficients[2] ggplot(bc,aes(x=Radius.Mean,y=binOutcome)) + geom_point(aes(colour=Outcome)) + stat_smooth(method="glm",formula=y~x,method.args=list(family="binomial"),se=F) + stat_smooth(method="lm",se=F,linetype=6,color="red") + geom_vline(xintercept=db,color="magenta") ``` ## Decision boundary Class = R if $\Pr(Y=1|X=x) > 0.5$ ```{r echo=F,warning=F,message=F} library(boot) bc <- bc %>% mutate(binOutcome = as.numeric(Outcome == "R")) bclr <- glm(Outcome ~ Radius.Mean + Compactness.Mean, data=bc, family="binomial") int <- (logit(0.5) - bclr$coefficients[1])/bclr$coefficients[3] slp <- -bclr$coefficients[2]/bclr$coefficients[3] ggplot(bc,aes(x=Radius.Mean,y=Compactness.Mean,color=Outcome)) + geom_point(size=3) + geom_abline(intercept=int,slope=slp,color="magenta") ``` ## Decision boundary Class = R if $\Pr(Y=1|X=x) > 0.25$ ```{r echo=F,warning=F,message=F} library(boot) bc <- bc %>% mutate(binOutcome = as.numeric(Outcome == "R")) bclr <- glm(Outcome ~ Radius.Mean + Compactness.Mean, data=bc, family="binomial") int <- (logit(0.25) - bclr$coefficients[1])/bclr$coefficients[3] slp <- -bclr$coefficients[2]/bclr$coefficients[3] ggplot(bc,aes(x=Radius.Mean,y=Compactness.Mean,color=Outcome)) + geom_point(size=3) + geom_abline(intercept=int,slope=slp,color="magenta") ``` ## Supervised Learning Methods: “Objective-driven” ----------------------------------------------------------------------------------------------------------------- Mthd. Form Objective ------------------- --------------------------------------------------------- ----------------------------------- OLS $h_w(\x) = \x^\T \w$ $\sum_{i=1}^n(h_\w(\x_i) - y_i)^2$ $\approx E[Y=y|\mathbf{X}=\x]$... ...using a linear function LR $h_\w(\x) = \frac{1}{1 + \mathrm{e}^{-\x^\T \w}}$ -$\sum_{i=1}^n y_i \log h_\w (\x_i) + (1-y_i) \log (1-h_\w (\x_i))$ $\approx P(Y=y|\mathbf{X}=\x)$... ...using sigmoid of a linear function SVM $h_\w(\x) = \mathrm{sgn}(\x^\T \w)$ ...using a linear function ----------------------------------------------------------------------------------------------------------------- ## Large Margin Classifiers:\ Linear Support Vector Machines - Linear classifiers that focus on learning the *decision boundary* rather than the conditional distribution $P(Y=y|\mathbf{X}=\x)$ - Perceptrons - Definition - Perceptron learning rule - Convergence - “Margin” idea and max margin classifiers - (Linear) support vector machines - Formulation as optimization problem ## Marvin Minsky, 1927-2016 ## Perceptrons HTF Ch. 4.5 - Consider a binary classification problem with data $\{{{\x_i}},y_i\}_{i=1}^n$, $y_i\in\{-1,+1\}$. **Note coding of $y_i$**. - A *perceptron* (Rosenblatt, 1957) is a classifier of the form: $$h_{{{\mathbf{w}}},w_0}({{\x}}) = \mbox{sign}(\x^\T \w+ w_0) = \left\{ \begin{array}{ll} +1 & \mathrm{if}~ \x^\T \w+ w_0\geq 0 \\ -1 & \mathrm{otherwise} \end{array} \right.$$ Here, ${{\mathbf{w}}}$ is a vector of weights, and $w_0$ is a constant offset. (**Note $x_0 = 1$ is omitted.**) - The decision boundary is $\x^\T \w+ w_0= 0$. - Perceptrons output a **class**, not a probability - An example $( {{\x}}, y )$ is classified correctly if: $$y \cdot (\x^\T \w+ w_0) > 0$$ ## Linear separability - The data set is *linearly separable* if and only if there exists ${{\mathbf{w}}}$, $w_0$ such that: - For all $i$, $y_i(\x_i^\T \w +w_0)>0$. - Or equivalently, the 0-1 loss $\sum_i \mathbf{1}_{y_i(\x_i^\T \w +w_0) < 0}$ is zero for some set of parameters $({\bf w}, w_0)$. ## Linear Separability ## The Perceptron Learning Rule - Consider the following procedure: 1. Initialize ${{\mathbf{w}}}$ and $w_0$ randomly 2. While any training examples remain incorrecty classified 1. Loop through all misclassified examples 2. For misclassified example $i$, perform the updates: $${{\mathbf{w}}}\gets {{\mathbf{w}}}+ \delta y_i{{\x}}_i,~~~~~w_0\gets w_0 + \delta y_i$$ where $\delta$ is a step-size parameter. - The update equation, or sometimes the whole procedure, is called the *perceptron learning rule*. - Intuition: For positive examples misclassified as negative, change ${{\mathbf{w}}}$ to increase $\x_i^\T \w +w_0$, and vice versa ## Error Minimization Interpretation - PLR can be interpreted as a gradient descent on the following function: $${{J}}({{\mathbf{w}}},w_0) = \sum_{i=1}^n \left\{ \begin{array}{ll} 0 & \mathrm{if}~ y_i(\x_i^\T \w + w_0)\geq 0 \\ -y_i(\x_i^\T \w + w_0) & \mathrm{if}~ y_i(\x_i^\T \w + w_0)<0 \end{array}\right.$$ - For correctly classified examples, the error is zero. - For incorrectly classified examples, the error is by how much $\x_i^\T \w +w_0$ is on the wrong side of the decision boundary. - $J$ is piecewise linear, so it has a gradient almost everywhere; stochastic gradient descent gives the perceptron learning rule. - $J$ is zero if and only if all examples are classified correctly – just like the 0-1 loss function. ## Perceptron convergence theorem - ***If*** classes are linearly separable ***then*** the perceptron learning rule will find a separater after some finite number of updates. - The number of updates depends on the data set, and also on the step size parameter. - If the classes are not linearly separable, there will be oscillation (which can be detected automatically). ```{r results='asis',fig.height=6,echo=F} n <- 20 set.seed(1) train <- data.frame(x1 = runif(n), x2 = runif(n)) train$y <- (train$x1 + train$x2 > 1)*2 - 1 w <- c(1,1) w0 <- 1 slp <- -w[1]/w[2] int <- -w0/w[2] mistakes = T plts <- list() pl <- 1 while(mistakes) { mistakes = F for (i in 1:n) { xv <- as.numeric(train[i,c(1,2)]) y <- as.numeric(train[i,'y']) if ((sum(w*xv) + w0)*y <= 0) { train$yhat <- factor(sign((as.matrix(train[,c("x1","x2")]) %*% w) + w0), levels = c(1,-1)) .e <- environment() plts[[pl]] <- ggplot(train,aes(x=x1,y=x2,color=factor(y, levels = c(1,-1)),shape=yhat),environment=.e) + annotate("point",x=xv[1],y=xv[2],size=3,color="red") + geom_point() + geom_abline(slope=slp,intercept=int) + ggtitle(sprintf("Update: %.1f w: [%.3f,%.3f], w0: %.3f",pl/2,w[1],w[2],w0)) + xlim(-0.5,1.5) + ylim(-0.5,1.5) + scale_shape_discrete(drop = FALSE) pl <- pl + 1 w <- w + y*xv w0 <- w0 + y mistakes = T slp <- -w[1]/w[2] int <- -w0/w[2] train$yhat <- factor(sign((as.matrix(train[,c("x1","x2")]) %*% w) + w0), levels = c(1,-1)) .e <- environment() plts[[pl]] <- ggplot(train,aes(x=x1,y=x2,color=factor(y, levels=c(1,-1)),shape=yhat),environment=.e) + annotate("point",x=xv[1],y=xv[2],size=3,color="red") + geom_point() + geom_abline(slope=slp,intercept=int) + ggtitle(sprintf("Update: %.1f w: [%.3f,%.3f], w0: %.3f",pl/2,w[1],w[2],w0)) + xlim(-0.5,1.5) + ylim(-0.5,1.5) + scale_shape_discrete(drop = FALSE) pl <- pl + 1 } } } for ( pl in 1:length(plts) ) { writeLines("## Perceptron Learning Example\n\n") plot(plts[[pl]]) writeLines("\n\n") } ``` ## Weight as a combination of input vectors - Recall percepton learning rule: $${{\mathbf{w}}}\gets {{\mathbf{w}}}+ \delta y_i{{\x}}_i,~~~~~w_0\gets w_0 + \delta y_i$$ - If initial weights are zero, then at any step, the *weights are a linear combination of feature vectors of the examples*: $${{\mathbf{w}}}= \sum_{i=1}^n \alpha_i y_i {{\x_i}},~~~~~w_0 =\sum_{i=1}^n \alpha_i y_i$$ where $\alpha_i$ is the sum of step sizes used for all updates based on example $i$. - This is called the *dual representation* of the classifier. - Even by the end of training, some example may have never participated in an update (they were always correct) so the corresponding $\alpha_i=0$. ## Examples used (bold) and not used (faint) in updates {.smaller} ## Comment: Solutions are nonunique ## Perceptron summary - Perceptrons can be learned to fit linearly separable data, using a gradient descent rule. - Blindingly fast - Solutions are non-unique ## Support Vector Machines - Support vector machines (SVMs) for binary classification can be viewed as a way of training perceptrons - Three main new ideas: - A optimization criterion (the "margin") guarantees uniqueness and has theoretical advantages - Natural handling nonseparable data by allowing mistakes - An efficient way of operating in expanded feature spaces: "kernel trick" - SVMs can also be used for multiclass classification and regression. ## Returning to the non-uniqueness issue - Consider a linearly separable binary classification data set - There is an infinite number of hyperplanes that separate the classes: - Which plane is best? - For a given plane, for which points should we be most confident in the classification? ## The margin, and linear SVMs - For a given separating hyperplane, the *margin* is two times the (Euclidean) distance from the hyperplane to the nearest training example. - Width of the "strip" around the decision boundary containing no training examples. - A linear SVM is a perceptron for which we choose ${{\mathbf{w}}},w_0$ so that margin is maximized ## Distance to the decision boundary - Suppose we have a decision boundary that separates the data. - Let $\gamma_i$ be the distance from instance ${{\x_i}}$ to the decision boundary. - How can we write $\gamma_i$ in terms of ${{\x_i}}, y_i, {{\mathbf{w}}}, w_0$? ## Distance to the decision boundary (II)
- ${{\mathbf{w}}}$ is normal to the decision boundary. Thus, $\frac\w{||{{\mathbf{w}}}||}$ is the unit normal of the boundary. - Vector from B to $\x_i$ is $\gamma_i \frac\w{||{{\mathbf{w}}}||}$. - B, the point on the boundary nearest ${{\x_i}}$, is ${{\x_i}}-\gamma_i \frac\w{||{{\mathbf{w}}}||}$. - Since B is on the boundary, $$\left({{\x_i}}-\gamma_i \frac\w{||{{\mathbf{w}}}||}\right)^\T \w + w_0 = 0$$ - Solving for $\gamma_i$ yields $$\gamma_i = \frac{\x_i^\T \w + w_0}{||\w||}$$
## The margin HTF Ch. 4.5, Ch 12 - The *margin of the hyperplane* is $2M$, where $M=\min_i y_i \gamma_i$ - The most direct statement of the problem of finding a maximum margin separating hyperplane is thus $$\max_{\w ,w_0} \min_i y_i \gamma_i \equiv \max_{\w ,w_0} \min_i y_i\frac{\x_i^\T \w + w_0}{||\w||}$$ - This turns out to be inconvenient for optimization, however ## Treating the $\gamma_i$ as constraints - From the definition of margin, we have: $$M \leq y_i \gamma_i = y_i \frac{\x_i^\T \w + w_0}{||\w||} ~~~~\forall i$$ - This suggests:\ --------------------- ----------------------------------------- maximize $M$ with respect to $M, {{\mathbf{w}}}, w_0$ subject to $M \leq y_i \frac{\x_i^\T \w + w_0}{||\w||}$ for all $i$ --------------------- ----------------------------------------- - Problems: - ${{\mathbf{w}}}$ appears nonlinearly in the constraints. - This problem is underconstrained. If $({{\mathbf{w}}},w_0,M)$ is an optimal solution, then so is $(\beta{{\mathbf{w}}},\beta w_0,M)$ for any $\beta>0$. ## Adding a constraint Let’s add the constraint that $M = 1 / \|{{\mathbf{w}}}\|$: - This allows us to rewrite the objective function: - This is really nice because the constraints are linear. --------------------- ----------------------------------------- maximize $\frac{1}{||\w||}$ with respect to ${{\mathbf{w}}}, w_0$ subject to $\frac{1}{||\w||} \leq y_i \frac{\x_i^\T \w + w_0}{||\w||}$ for all $i$ --------------------- ----------------------------------------- which is the same as --------------------- ----------------------------------------- maximize $\frac{1}{||\w||}$ with respect to ${{\mathbf{w}}}, w_0$ subject to $1 \le y_i (\x_i^\T \w + w_0)$ for all $i$ --------------------- ----------------------------------------- ## Final formulation - Let’s minimize $\|{{\mathbf{w}}}\|^2$ instead of maximizing $\frac{1}{||\w ||}$. (Taking the square is a monotone transformation, as $\|{{\mathbf{w}}}\|$ is postive, so this doesn’t change the optimal solution.) - This gets us to: -------- --------------------------------------------------------------------------- minimize $\|{{\mathbf{w}}}\|^2$ w.r.t. ${{\mathbf{w}}}, w_0$ subject to $y_i(\x_i^\T \w +w_0)\geq1$ -------- --------------------------------------------------------------------------- - This we can solve! How? - It is a convex *quadratic programming* (QP) problem—a standard type of optimization problem for which many efficient packages are available. ## Example We have a solution, but no “support vectors” yet... ## What are "Support Vectors"? ----------- --------------------------------------------------------------- ---------- ------- minimize $\frac{1}{2} \|{{\mathbf{w}}}\|^2$ w.r.t. ${{\mathbf{w}}}, w_0$ subject to $y_i(\x_i^\T \w +w_0)\geq1$ ----------- --------------------------------------------------------------- ---------- ------- - Turns out (HTF Ch. 4.5.2) we can write: $${\bf w}=\sum_i \alpha_i y_i \x_i,~~\mbox{where $\alpha_i \ge 0$}$$ - As for the perceptron with zero initial weights, the optimal solution for ${{\mathbf{w}}}$ and $w_0$ is a linear combination of the ${{\x_i}}$. - The output is therefore: $$h_{\w,w_0}(\x) = \mbox{sign} \left(\sum_{i=1}^n \alpha_i y_i ({{\x_i}}\cdot {{\x}}) +w_0\right)$$ - Output depends on weighted dot product of input vector with training examples ## Solving “the dual” - We can actually solve directly for the $\alpha_i$ (again see HTF Ch. 4.5.2): $$\max_{{{\boldsymbol{\alpha}}}} \sum_{i=1}^n \alpha_i -\frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n y_i y_j \alpha_i \alpha_j (\x_i \cdot \x_j)$$ with constraints: $\alpha_i \geq 0 \mbox{ and} \sum_i \alpha_i y_i =0$ - This is also a QP ## The support vectors - Suppose we find optimal ${{\boldsymbol{\alpha}}}$s (e.g., using a standard QP package) - The $\alpha_i$ will be $>0$ only for the points for which $y_i(\x_i^\T \w + w_0)=1$ - These are the points lying on the edge of the margin, and they are called *support vectors*, because they define the decision boundary - The output of the classifier for query point $\x$ is computed as: $$\mbox{sgn}\left[\left(\sum_{i=1}^n \alpha_i y_i (\x_i \cdot \x)\right) + w_0 \right]$$ Hence, the output is determined by computing the *dot product of the point with the support vectors*! ## Example Support vectors are in bold ## But why all this work? - SVMs are a state-of-the-art for classification when you don’t need probability estimates - Inuitively, the large-margin property makes sense. Theory backs this up. - SVMs offer “off-the-shelf” *non*-linear classification without having to do explicit feature construction, as we will see. ## Soft margin classifiers - Recall that in the linearly separable case, we compute the solution to the following optimization problem: -------- --------------------------------------------------- ---------------------- min $\frac{1}{2}\|{{\mathbf{w}}}\|^2$ w.r.t. ${{\mathbf{w}}}, w_0$ s.t. $y_i(\x_i^\T \w + w_0)\geq1$ -------- --------------------------------------------------- ---------------------- - What if we can't satisfy the constraints? ## Soft margin classifiers - To allow misclassifications, we relax the constraints to: $$y_i(\x_i^\T \w + w_0) \geq 1-\xi_i$$ - If $\xi_i \in (0,1)$, the data point is within the margin - If $\xi_i \geq 1$, then the data point is misclassified - We define the *soft error* as $\sum_i \xi_i$; each $\xi_i$ is a *slack variable* ## Problem formulation with soft errors - Instead of: -------- --------------------------------------------------------------------------- min $\frac{1}{2}\|{{\mathbf{w}}}\|^2$ w.r.t. ${{\mathbf{w}}}, w_0$ s.t. $y_i(\x_i^\T \w + w_0)\geq1$ -------- --------------------------------------------------------------------------- we want to solve: -------- ------------------------------------------------------------------------------------------------- min $\frac{1}{2}\|{{\mathbf{w}}}\|^2+ C \sum_i \xi_i$ w.r.t. ${{\mathbf{w}}}, w_0, \xi_i$ s.t. $y_i(\x_i^\T \w + w_0)\geq1-\xi_i$, $\xi_i \geq 0$ -------- ------------------------------------------------------------------------------------------------- - Note that soft errors include points that are misclassified,\ as well as points within the margin - There is a linear penalty for both categories - The choice of the *constant $C$ controls boundary-fitting* ## A built-in boundary-fitting knob -------- --------------------------------------------------------------------------------- min $\frac{1}{2}\|{{\mathbf{w}}}\|^2+ C \sum_i \xi_i$ w.r.t. ${{\mathbf{w}}}, w_0, \xi_i$ s.t. $y_i(\x_i^\T \w +w_0) \geq 1-\xi_i$, $\xi_i \geq 0$ -------- --------------------------------------------------------------------------------- - If $C$ is very small, there is almost no penalty for soft errors, so the focus is on maximizing the margin, even if this means more mistakes - If $C$ is very large, the emphasis on the soft errors will decrease the margin, if this helps to classify more examples correctly. - Internal cross-validation is a good way to choose $C$ appropriately ## Dual form for the soft margin problem - Like before, we can formulate a “dual” problem that identifies the support vectors: ------------------------- Primal form: ------------ --------------------------------------------------------------------------------------------------------------- min $\|{{\mathbf{w}}}\|^2+{\color{red}{C\sum_i\xi_i}}$ w.r.t. ${{\mathbf{w}}}, w_0, {\color{red}{\xi_i}}$ s.t. ${{y}}_i(\x_i^\T \w +w_0)\geq {\color{red}{(1-\xi_i)}}$, $\xi_i\geq 0$ ---------------------------------------------------------------------------------------------------------------------------- ----------------- Dual form: ---------- -------------------------------------------------------------------------------------------------------------------------- max $\sum_{i=1}^n\alpha_i-\frac{1}{2}\sum_{i,j=1}^n{{y}}_i{{y}}_j\alpha_i\alpha_j({{\x_i}}\cdot{{\x_j}})$ w.r.t. $\alpha_i$ s.t. $0\leq\alpha_i {\color{red}{\leq C}}$, $\sum_{i=1}^n\alpha_i{{y}}_i=0$ ------------------------------------------------------------------------------------------------------------------------------------ - All the previously described machinery can be used to solve this problem