2017-02-09

## Linear models in general HTF Ch. 2.8.3

• By linear models, we mean that the hypothesis function $$h_{\bf w}({\bf x})$$ is a linear function of the parameters $${\bf w}$$.

• Predictions are a linear combination of feature values

• $h_{\bf w}({\mathbf{x}}) = \sum_{k=0}^{p} w_k \phi_k({\mathbf{x}}) = {{\boldsymbol{\phi}}}({\mathbf{x}})^{\mathsf{T}}{{\mathbf{w}}}$ where $$\phi_k$$ are called basis functions As usual, we let $$\phi_0({\mathbf{x}})=1, \forall {\mathbf{x}}$$, to create a bias.

• To recover degree-$$d$$ polynomial regression in one variable, set $\phi_0(x) = 1, \phi_1(x) = x, \phi_2(x) = x^2, ..., \phi_d(x) = x^d$

• Basis functions are fixed for training

## Linear Methods for Classification

• Error functions for classification

• Logistic Regression

• Support Vector Machines

## Example: Given nucleus radius, predict cancer recurrence

ggplot(bc,aes(Radius.Mean,fill=Outcome,color=Outcome)) + geom_density(alpha=I(1/2))

## Example: Solution by linear regression

• Univariate real input: nucleus size
• Output coding: non-recurrence = 0, recurrence = 1
• Sum squared error minimized by the blue line

## Linear regression for classification

• The predictor shows an increasing trend towards recurrence with larger nucleus size, as expected.

• Output cannot be directly interpreted as a class prediction.

• Thresholding output (e.g., at 0.5) could be used to predict 0 or 1.
(In this case, prediction would be 0 except for extremely large nucleus size.)

• Interpret as probability? Not bounded to $$[0,1]$$, not consistent even for well-separated data

## Probabilistic view

• Suppose we have two possible classes: $$y\in \{0,1\}$$.

• The symbols “$$0$$” and “$$1$$” are unimportant. Could have been $$\{a,b\}$$, $$\{\mathit{up},\mathit{down}\}$$, whatever.

• Rather than try to predict the class label directly, ask:
What is the probability that a given input $${\mathbf{x}}$$ to has class $$y=1$$?

• Conditional Probability:

$P(y=1|{\mathbf{X}}= {\mathbf{x}}) = \frac{P({\mathbf{X}}= {\mathbf{x}}, y=1)}{P({\mathbf{X}}= {\mathbf{x}})}$ - Bayes' Rule

$= \frac{P({\mathbf{X}}= {\mathbf{x}}| y=1)P(y=1)}{P({\mathbf{X}}= {\mathbf{x}}|y=1)P(y=1)+P({\mathbf{X}}= {\mathbf{x}}|y=0)P(y=0)}$

## Probabilistic models for binary classification

• Can also write: $P(y=1|{\mathbf{X}}= {\mathbf{x}})=\sigma\left(\log\frac{P(y=1|{\mathbf{X}}= {\mathbf{x}})}{P(y=0|{\mathbf{X}}= {\mathbf{x}})}\right) = \sigma\left(\log\frac{P({\mathbf{X}}= {\mathbf{x}}|y=1)P(y=1)}{P({\mathbf{X}}= {\mathbf{x}}|y=0)P(y=0)}\right)$ where $$\sigma(a) = \frac{1}{1+\exp(-a)}$$, the sigmoid or logistic function.

• Discriminative Learning:
• Model $$\log\frac{P(y=1|{\mathbf{X}}= {\mathbf{x}})}{P(y=0|{\mathbf{X}}= {\mathbf{x}})}$$ (log odds) as a function of $$\mathbf{x}$$

• Only models how to discriminate between examples of the two classes. Does not model distribution of $$\mathbf{x}$$.

• Generative Learning:
• Model $$P(y=1), P(y=0), P({\mathbf{X}}= {\mathbf{x}}|y=1), P({\mathbf{X}}= {\mathbf{x}}|y=0)$$, then use rightmost formula above

• Models the full joint; can actually use the model to generate (i.e. fantasize) data

## Logistic regression HTF (Ch. 4.4)

• Represent the hypothesis as a logistic function of a linear combination of inputs: $h({\mathbf{x}}) = \sigma({\mathbf{x}}^{\mathsf{T}}{\mathbf{w}})$

• Interpret $$h({\mathbf{x}})$$ as $$P(y=1|{\mathbf{X}}= {\mathbf{x}})$$, interpret $${\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}$$ as the log-odds

• How do we choose $${\bf w}$$?

• In the probabilistic framework, observing $$\langle {\mathbf{x}}_i , 1 \rangle$$ does not mean $$h({\mathbf{x}}_i)$$ should be as close to $$1$$ as possible.

• Maximize probability the model assigns to the $$y_i$$, given the $${\mathbf{x}}_i$$.

## Max Conditional Likelihood

• Maximize probability the model assigns to the $$y_i$$, given the $${\mathbf{x}}_i$$.

• Assumption 1: Examples are i.i.d. Probability of observing all $$y$$s is product $\begin{gathered} P(Y_1=y_1, Y_2=y_2, ..., Y_n = y_n|X_1 = {\mathbf{x}}_1, X_2 = {\mathbf{x}}_2, ..., X_n = {\mathbf{x}}_n) \\ = \prod_{i=1}^n P(Y_i = y_i | X_i = {\mathbf{x}}_i)\end{gathered}$

• Assumption 2: \begin{aligned} P(y = 1|{\mathbf{X}}= {\mathbf{x}}) & = h_{\mathbf{w}}({\mathbf{x}}) = 1 / (1 + \exp(-{\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}))\\ P(y = 0|{\mathbf{X}}= {\mathbf{x}}) & = (1 - h_{\mathbf{w}}({\mathbf{x}}))\\\end{aligned}

## Max Conditional Likelihood

• Maximize probability the model assigns to the $$y_i$$, given the $${\mathbf{x}}_i$$.

• More stable to maximize log probability. Note

\begin{aligned} \log P(Y_i = y_i | X_i = {\mathbf{x}}_i) & = \left\{ \begin{array}{ll} \log h_{\mathbf{w}}({\mathbf{x}}_i) & \mbox{if}~y_i=1 \\ \log(1-h_{\mathbf{w}}({\mathbf{x}}_i)) & \mbox{if}~y_i=0 \end{array} \right. \end{aligned}

• Therefore,

$\log \prod_{i=1}^n P(Y_i = y_i | X_i = {\mathbf{x}}_i) = \sum_{i = 1}^n \left[y_i \log( h_{\mathbf{w}}({\mathbf{x}}_i)) + (1 - y_i) \log (1 - h_{\mathbf{w}}({\mathbf{x}}_i))\right]$

• Suggests an error \begin{aligned} \hspace{-2em} J(h_{{\mathbf{w}}}) = - \sum_{i = 1}^n \left[y_i \log( h_{\mathbf{w}}({\mathbf{x}}_i)) + (1 - y_i) \log (1 - h_{\mathbf{w}}({\mathbf{x}}_i))\right]\end{aligned}

• This is the cross entropy. Number of bits to transmit $$y_i$$
if both parties know $$h_{\mathbf{w}}$$ and $${\mathbf{x}}_i$$.

## Back to the breast cancer problem

 Logistic Regression: ## (Intercept) Radius.Mean ## -3.4671348 0.1296493 Least Squares: ## (Intercept) Radius.Mean ## -0.17166939 0.02349159

## Supervised Learning Methods: “Objective-driven”

Mthd. Form Objective
OLS $$h_w({\mathbf{x}}) = {\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}$$ $$\sum_{i=1}^n (h_{\mathbf{w}}({\mathbf{x}}_i) - y_i)^2$$
$$\approx E[Y=y|\mathbf{X}={\mathbf{x}}]$$… …using a linear function
LR $$h_w({\mathbf{x}}) = \frac{1}{1 + \mathrm{e}^{-{\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}}}$$ $$-\sum_{i=1}^n y_i \log h_{\mathbf{w}}({\mathbf{x}}_i) + (1-y_i) \log (1-h_{\mathbf{w}}({\mathbf{x}}_i))$$
$$\approx P(Y=y|\mathbf{X}={\mathbf{x}})$$… …using sigmoid of a linear function
• Both model the conditional mean of $$y$$ using a (transformed) linear function
• Both use maximum conditional likelihood to estimate

## Decision boundary HTF Ch. 2.3.1,2.3.2

• How complicated is a classifier?

• One way to think about it is in terms of its decision boundary, i.e. the line it defines for separating examples

• Linear classifiers draw a hyperplane between examples of the different classes. Non-linear classifiers draw more complicated surfaces between the different classes.

• For a probabilistic classifier with a cutoff of 0.5,
the decision boundary is the curve on which: $\frac{P(y=1|{\mathbf{X}}= {\mathbf{x}})}{P(y=0|{\mathbf{X}}= {\mathbf{x}})} = 1, \mbox{i.e., where } \log\frac{P(y=1|{\mathbf{X}}= {\mathbf{x}})}{P(y=0|{\mathbf{X}}= {\mathbf{x}})} = 0$

## Decision boundary

Class = R if $${\mathrm{Pr}}(Y=1|X=x) > 0.5$$

## Decision boundary

Class = R if $${\mathrm{Pr}}(Y=1|X=x) > 0.25$$