2017-10-03

Linear models in general (HTF Ch. 2.8.3)

  • By linear models, we mean that the hypothesis function \(h_{\bf w}({\bf x})\) is a (transformed) linear function of the parameters \({\bf w}\).

  • Predictions are a (transformed) linear combination of feature values

\[h_{\bf w}({\mathbf{x}}) = g\left(\sum_{k=0}^{p} w_k \phi_k({\mathbf{x}})\right) = g({{\boldsymbol{\phi}}}({\mathbf{x}})^{\mathsf{T}}{{\mathbf{w}}})\]

  • where \(\phi_k\) are called basis functions As usual, we let \(\phi_0({\mathbf{x}})=1, \forall {\mathbf{x}}\), so that we don't force \(h_{\bf w}(0) = 0\)

  • Polynomial regression: set \(\phi_0(x) = 1, \phi_1(x) = x, \phi_2(x) = x^2, ..., \phi_d(x) = x^d\) and set \(g(x) = x\).

  • Basis functions are fixed for training (but can be chosen through model selection)

Linear Methods for Classification

  • Classification tasks

  • Loss functions for classification

  • Logistic Regression

  • Support Vector Machines

Example: Given nucleus radius, predict cancer recurrence

ggplot(bc,aes(Radius.Mean,fill=Outcome,color=Outcome)) + geom_density(alpha=I(1/2))

Example: Solution by linear regression

  • Univariate real input: nucleus size
  • Output coding: non-recurrence = 0, recurrence = 1
  • Sum squared error minimized by the blue line

Linear regression for classification

  • The predictor shows an increasing trend towards recurrence with larger nucleus size, as expected.

  • Output cannot be directly interpreted as a class prediction.

  • Thresholding output (e.g., at 0.5) could be used to predict 0 or 1.
    (In this case, prediction would be 0 except for extremely large nucleus size.)

Probabilistic view

  • Suppose we have two possible classes: \(y\in \{0,1\}\).

  • The symbols “\(0\)” and “\(1\)” are unimportant. Could have been \(\{a,b\}\), \(\{\mathit{up},\mathit{down}\}\), whatever.

  • Rather than try to predict the class label directly, ask:
    What is the probability that a given input \({\mathbf{x}}\) to has class \(y=1\)?

  • Conditional Probability:

\[P(y=1|{\mathbf{X}}= {\mathbf{x}}) = \frac{P({\mathbf{X}}= {\mathbf{x}}, y=1)}{P({\mathbf{X}}= {\mathbf{x}})} \]

Probabilistic models for binary classification

What kind of function do we use for \(P(y=1|{\mathbf{X}}= {\mathbf{x}})\)?

Idea: \(h_{\mathbf{w}}({\mathbf{x}}) = {\mathbf{w}}^{\mathsf{T}}{\mathbf{x}}\)

Why? Why not?

Sigmoid function

\[\sigma(x) = \frac{1}{1+e^{-x}}\]

Logistic Regression HTF (Ch. 4.4)

  • Represent the hypothesis as a logistic function of a linear combination of inputs, interpret \(h({\mathbf{x}})\) as \(P(y=1|{\mathbf{X}}= {\mathbf{x}})\): \[h_{\mathbf{w}}({\mathbf{x}}) = \sigma({\mathbf{x}}^{\mathsf{T}}{\mathbf{w}})\]

  • \(\sigma(a) = \frac{1}{1+\exp(-a)}\) is the sigmoid or logistic function

  • With a little algebra, we can write: \[P(y=1|{\mathbf{X}}= {\mathbf{x}})=\sigma\left(\log\frac{P(y=1|{\mathbf{X}}= {\mathbf{x}})}{P(y=0|{\mathbf{X}}= {\mathbf{x}})}\right)\]

    • Interpret \({\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}\) as the log-odds

Logistic regression training HTF (Ch. 4.4)

  • How do we choose \({\bf w}\)?

  • In the probabilistic framework, observing \(\langle {\mathbf{x}}_i , 1 \rangle\) does not mean \(h_{\mathbf{w}}({\mathbf{x}}_i)\) should be as close to \(1\) as possible.

  • Maximize probability the model assigns to the \(y_i\), given the \({\mathbf{x}}_i\) by adjusting \({\mathbf{w}}\).

Reminder: Independence

  • Two random variables \(X\) and \(Y\) that are part of a random vector are independent iff:

\[ F_{X,Y}(x,y) = F_X(x)F_Y(y) \]

If they have a joint density or joint PMF, then

\[ f_{X,Y}(x,y) = f_X(x)f_Y(y) \]

Max Conditional Likelihood

  • Maximize probability the model assigns to the \(y_i\), given the \({\mathbf{x}}_i\) by adjusting \({\mathbf{w}}\).

  • Assumption 1: Examples are i.i.d. Probability of observing all \(y\)s is product

    \(\begin{aligned} P(\mathrm{all~y}|\mathrm{all~x}) & = P(Y_1=y_1, Y_2=y_2, ..., Y_n = y_n|X_1 = {\mathbf{x}}_1, X_2 = {\mathbf{x}}_2, ..., X_n = {\mathbf{x}}_n)\\ & = \prod_{i=1}^n P(Y_i = y_i | X_1 = {\mathbf{x}}_1, X_2 = {\mathbf{x}}_2, ..., X_n = {\mathbf{x}}_n)\\ & = \prod_{i=1}^n P(Y_i = y_i | X_i = {\mathbf{x}}_i)\end{aligned}\)

  • Assumption 2: \(\begin{aligned} P(y = 1|{\mathbf{X}}= {\mathbf{x}}) & = h_{\mathbf{w}}({\mathbf{x}}) = 1 / (1 + \exp(-{\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}))\\ P(y = 0|{\mathbf{X}}= {\mathbf{x}}) & = (1 - h_{\mathbf{w}}({\mathbf{x}}))\\\end{aligned}\)

Max Conditional Likelihood

  • Maximize probability the model assigns to the \(y_i\), given the \({\mathbf{x}}_i\) by adjusting \({\mathbf{w}}\).

  • More stable to maximize log probability. Note

\[\begin{aligned} \log P(Y_i = y_i | X_i = {\mathbf{x}}_i) & = \left\{ \begin{array}{ll} \log h_{\mathbf{w}}({\mathbf{x}}_i) & \mbox{if}~y_i=1 \\ \log(1-h_{\mathbf{w}}({\mathbf{x}}_i)) & \mbox{if}~y_i=0 \end{array} \right. \end{aligned} \]

  • Therefore,

\[\log \prod_{i=1}^n P(Y_i = y_i | X_i = {\mathbf{x}}_i) = \sum_{i = 1}^n \left[y_i \log( h_{\mathbf{w}}({\mathbf{x}}_i)) + (1 - y_i) \log (1 - h_{\mathbf{w}}({\mathbf{x}}_i))\right] \]

  • Suggests an error \[\begin{aligned} \hspace{-2em} J(h_{{\mathbf{w}}}) = - \sum_{i = 1}^n \left[y_i \log( h_{\mathbf{w}}({\mathbf{x}}_i)) + (1 - y_i) \log (1 - h_{\mathbf{w}}({\mathbf{x}}_i))\right]\end{aligned}\]

  • This is the cross entropy. Number of bits to transmit \(y_i\)
    if both parties know \(h_{\mathbf{w}}\) and \({\mathbf{x}}_i\).

Back to the breast cancer problem

Logistic Regression:

## (Intercept) Radius.Mean 
##  -3.4671348   0.1296493

Least Squares:

## (Intercept) Radius.Mean 
## -0.17166939  0.02349159

Probability and Expectation

  • Why are these so close?
  • The expected value of a discrete random variable \(Y\) is denoted

\[{\mathrm{E}}[Y] = \sum_{y \in \mathcal{Y}} y \cdot p_Y(Y = y)\]

  • Consider a random variable \(Y \in \{0,1\}\)

\(\begin{aligned} {\mathrm{E}}[Y] & = \sum_{y \in \{0,1\}} y \cdot p_Y(Y = y)\\ & = 0 \cdot p_Y(Y = 0) + 1 \cdot p_Y(Y = 1)\\ & = p_Y(Y = 1) \end{aligned}\)

  • Though we did not discuss in this way, linear regression tries to estimate \({\mathrm{E}}[Y = y | X = x]\). So, makes sense that the OLS and logistic regression answers can be close.

Supervised Learning Methods: “Objective-driven”

Mthd. Form Objective
OLS \(h_w({\mathbf{x}}) = {\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}\) \(\sum_{i=1}^n (h_{\mathbf{w}}({\mathbf{x}}_i) - y_i)^2\)
\(\approx E[Y=y|\mathbf{X}={\mathbf{x}}]\)… …using a linear function
LR \(h_w({\mathbf{x}}) = \frac{1}{1 + \mathrm{e}^{-{\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}}}\) \(-\sum_{i=1}^n y_i \log h_{\mathbf{w}}({\mathbf{x}}_i) + (1-y_i) \log (1-h_{\mathbf{w}}({\mathbf{x}}_i))\)
\(\approx P(Y=y|\mathbf{X}={\mathbf{x}})\)… …using sigmoid of a linear function
  • Both model the conditional mean of \(y\) using a (transformed) linear function
  • Both use maximum conditional likelihood to estimate

Decision boundary HTF Ch. 2.3.1,2.3.2

  • How complicated is a classifier?

  • One way to think about it is in terms of its decision boundary, i.e. the line it defines for separating examples

  • Linear classifiers draw a hyperplane between examples of the different classes. Non-linear classifiers draw more complicated surfaces between the different classes.

  • For a probabilistic classifier with a cutoff of 0.5,
    the decision boundary is the curve on which: \[\frac{P(y=1|{\mathbf{X}}= {\mathbf{x}})}{P(y=0|{\mathbf{X}}= {\mathbf{x}})} = 1, \mbox{i.e., where } \log\frac{P(y=1|{\mathbf{X}}= {\mathbf{x}})}{P(y=0|{\mathbf{X}}= {\mathbf{x}})} = 0\]

  • For logistic regression, this this is where \({\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}= 0\).

Decision boundaries of linear classifiers

  • Recall: predictions are a (transformed) linear combination of feature values

\[h_{\bf w}({\mathbf{x}}) = g({\mathbf{x}}^{\mathsf{T}}{{\mathbf{w}}})\]

  • Suppose our decision boundary is \[h_{\bf w}({\mathbf{x}}) = c\].

  • This is equivalent to \[{\mathbf{x}}^{\mathsf{T}}{{\mathbf{w}}} = c'\]

where \(c' = g^{-1}(c)\).

Decision boundary

Class = R if \({\mathrm{Pr}}(Y=1|X=x) > 0.5\)

Decision boundary

Class = R if \({\mathrm{Pr}}(Y=1|X=x) > 0.25\)