2016-02-09

Linear models in general HTF Ch. 2.8.3

  • By linear models, we mean that the hypothesis function \(h_{\bf w}({\bf x})\) is a linear function of the parameters \({\bf w}\).

  • Predictions are a linear combination of feature values

  • \[h_{\bf w}({\mathbf{x}}) = \sum_{k=0}^{p} w_k \phi_k({\mathbf{x}}) = {{\boldsymbol{\phi}}}({\mathbf{x}})^{\mathsf{T}}{{\mathbf{w}}}\] where \(\phi_k\) are called basis functions (or features!) As usual, we let \(\phi_0({\mathbf{x}})=1, \forall {\mathbf{x}}\), to create a bias.

  • To recover degree-\(d\) polynomial regression in one variable, set \[\phi_0(x) = 1, \phi_1(x) = x, \phi_2(x) = x^2, ..., \phi_d(x) = x^d\]

  • Basis functions are fixed for a given analysis

Linear Methods for Classification

  • Classification tasks

  • Error functions for classification

  • Logistic Regression

  • Generalized Linear Models

  • Support Vector Machines

Example: Given nucleus radius, predict cancer recurrence

ggplot(bc,aes(Radius.Mean,fill=Outcome,color=Outcome)) + geom_density(alpha=I(1/2))

Example: Solution by linear regression

  • Univariate real input: nucleus size
  • Output coding: non-recurrence = 0, recurrence = 1
  • Sum squared error minimized by the blue line

Linear regression for classification

  • The predictor shows an increasing trend towards recurrence with larger nucleus size, as expected.

  • Output cannot be directly interpreted as a class prediction.

  • Thresholding output (e.g., at 0.5) could be used to predict 0 or 1.
    (In this case, prediction would be 0 except for extremely large nucleus size.)

  • Interpret as probability? Not bounded to \([0,1]\), not consistent even for well-separated data

Probabilistic view

  • Suppose we have two possible classes: \(y\in \{0,1\}\).

  • The symbols “\(0\)” and “\(1\)” are unimportant. Could have been \(\{a,b\}\), \(\{\mathit{up},\mathit{down}\}\), whatever. We’ll use \(y\in \{0,1\}\) though.

  • Rather than try to predict the class label directly, ask:
    What is the probability that a given input \({\mathbf{x}}\) to has class \(y=1\)?

  • Bayes Rule:

\[P(y=1|{\mathbf{x}}) = \frac{P({\mathbf{x}}, y=1)}{P({\mathbf{x}})} = \frac{P({\mathbf{x}}| y=1)P(y=1)}{P({\mathbf{x}}|y=1)P(y=1)+P({\mathbf{x}}|y=0)P(y=0)} \]

Probabilistic models for binary classification

  • Can also write: \[P(y=1|{\mathbf{x}})=\sigma\left(\log\frac{P(y=1|{\mathbf{x}})}{P(y=0|{\bf x})}\right) = \sigma\left(\log\frac{P({\mathbf{x}}|y=1)P(y=1)}{P({\mathbf{x}}|y=0)P(y=0)}\right)\] where \(\sigma(a) = \frac{1}{1+\exp(-a)}\), the sigmoid or logistic function.

  • Discriminative Learning:
    • Model\(\log\frac{P(y=1|{\mathbf{x}})}{P(y=0|{\mathbf{x}})}\) (log odds ratio) as a function of \(\mathbf{x}\)

    • Only models how to discriminate between examples of the two classes. Does not model distribution of \(\mathbf{x}\).

  • Generative Learning:
    • Model \(P(y=1), P(y=0), P({\mathbf{x}}|y=1), P({\mathbf{x}}|y=0)\), then use rightmost formula above

    • Models the full joint; can actually use the model to generate (i.e. fantasize) data

Logistic regression HTF (Ch. 4.4)

  • Represent the hypothesis as a logistic function of a linear combination of inputs: \[h({\mathbf{x}}) = \sigma({\mathbf{x}}^{\mathsf{T}}{\mathbf{w}})\]

  • Interpret \(h({\mathbf{x}})\) as \(P(y=1|{\mathbf{x}})\), interpret \({\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}\) as the log-odds ratio.

  • How do we choose \({\bf w}\)?

  • In the probabilistic framework, observing \(\langle {\mathbf{x}}_i , 1 \rangle\) ( \(\langle {\mathbf{x}}_i , 0 \rangle\) ) does not mean \(h({\mathbf{x}}_i)\) should be \(1\) (\(0\))

  • Maximize probability of having observed the \(y_i\), given the \({\mathbf{x}}_i\).

Max Conditional Likelihood

  • Maximize probability of having observed the \(y_i\), given the \({\mathbf{x}}_i\).

  • Assumption 1: Examples are i.i.d. Probability of observing all \(y\)s is product \[\begin{gathered} P(Y_1=y_1, Y_2=y_2, ..., Y_n = y_n|X_1 = {\mathbf{x}}_1, X_2 = {\mathbf{x}}_2, ..., X_n = {\mathbf{x}}_n) \\ = \prod_{i=1}^n P(Y_i = y_i | X_i = {\mathbf{x}}_i)\end{gathered}\]

  • Assumption 2: \[\begin{aligned} P(y = 1|{\mathbf{x}}) & = h_{\mathbf{w}}({\mathbf{x}}) = \sigma({\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}) = 1 / (1 + \exp(-{\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}))\\ P(y = 0|{\mathbf{x}}) & = (1 - \sigma({\mathbf{x}}^{\mathsf{T}}{\mathbf{w}})) = \exp(-{\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}) / (1 + \exp(-{\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}))\\\end{aligned}\]

  • Probability will underflow; use log probability instead. Therefore \[\begin{aligned} \hspace{-2em} \log \prod_{i=1}^n P(Y_i = y_i | X_i = {\mathbf{x}}_i) & = \sum_{i = 1}^n \left[y_i \log( h_{\mathbf{w}}({\mathbf{x}}_i)) + (1 - y_i) \log (1 - h_{\mathbf{w}}({\mathbf{x}}_i))\right]\end{aligned}\]

Min Cross-Entropy

  • Maximize probability of having observed the \(y_i\), given the \({\mathbf{x}}_i\).

  • More stable to maximize log probability. Note

\[\begin{aligned} \log P(Y_i = y_i | X_i = {\mathbf{x}}_i) & = \left\{ \begin{array}{ll} \log h_{\mathbf{w}}({\mathbf{x}}_i) & \mbox{if}~y_i=1 \\ \log(1-h_{\mathbf{w}}({\mathbf{x}}_i)) & \mbox{if}~y_i=0 \end{array} \right. \end{aligned} \]

  • Therefore,

\[\log \prod_{i=1}^n P(Y_i = y_i | X_i = {\mathbf{x}}_i) = \sum_{i = 1}^n \left[y_i \log( h_{\mathbf{w}}({\mathbf{x}}_i)) + (1 - y_i) \log (1 - h_{\mathbf{w}}({\mathbf{x}}_i))\right] \]

  • Suggests an error \[\begin{aligned} \hspace{-2em} J(h_{{\mathbf{w}}}) = - \sum_{i = 1}^n \left[y_i \log( h_{\mathbf{w}}({\mathbf{x}}_i)) + (1 - y_i) \log (1 - h_{\mathbf{w}}({\mathbf{x}}_i))\right]\end{aligned}\]

  • This is the cross entropy. Number of bits to transmit \(y_i\)
    if both parties know \(h_{\mathbf{w}}\) and \({\mathbf{x}}_i\).

Back to the breast cancer problem

Logistic Regression:

## (Intercept) Radius.Mean 
##  -3.4671348   0.1296493

Least Squares:

## (Intercept) Radius.Mean 
## -0.17166939  0.02349159

Supervised Learning Methods: “Objective-driven”

Mthd. Form Objective
OLS \(h_w({\mathbf{x}}) = {\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}\) \(\sum_{i=1}^n (h_{\mathbf{w}}({\mathbf{x}}_i) - y_i)^2\)
\(\approx E[Y=y|\mathbf{X}={\mathbf{x}}]\)… …using a linear function
LR \(h_w({\mathbf{x}}) = \frac{1}{1 + \mathrm{e}^{-{\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}}}\) \(-\sum_{i=1}^n y_i \log h_{\mathbf{w}}({\mathbf{x}}_i) + (1-y_i) \log (1-h_{\mathbf{w}}({\mathbf{x}}_i))\)
\(\approx P(Y=y|\mathbf{X}={\mathbf{x}})\)… …using sigmoid of a linear function
  • Both model the conditional mean of \(y\) using a (transformed) linear function
  • Both use maximum conditional likelihood to estimate

Generalized Linear Models

  • Model the conditional mean \(Y|{\mathbf{X}}\), denoted $_
  • Assumption: \(g(\hat\mu_{\mathbf{x}}) = {\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}\)
  • \(g\) is the link function
    • Linear regression: \({\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}= \hat\mu_{\mathbf{x}}\), \(\hat\mu_{\mathbf{x}}= {\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}\)
      • Identity link: \(g(y) = y\)
    • Logistic regression: \({\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}= \ln \frac{\hat\mu_{\mathbf{x}}}{1 - \hat\mu_{\mathbf{x}}}\), \(\hat\mu_{\mathbf{x}}= \frac{1}{1 + \mathrm{e}^{-{\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}}}\)
      • Logit link: \(g(y) = \ln \frac{y}{1 - y}\)

Poisson Distribution

Poisson Distribution

Poisson Regression

  • Assume \(Y|X\) is Poisson
  • \(\hat\lambda_{\mathbf{x}}= \hat\mu_{\mathbf{x}}= \mathrm{e}^{{\mathbf{w}}^{\mathsf{T}}{\mathbf{x}}}\)
  • \({\mathbf{w}}^{\mathsf{T}}{\mathbf{x}}= \ln \hat\lambda_{\mathbf{x}}= \ln\hat\mu_{\mathbf{x}}\)
  • Link function is \(g(y) = \ln y\)

Horseshoe Crabs

##    Satellites         Width       Dark     GoodSpine
##  Min.   : 0.000   Min.   :21.0   no :107   no :121  
##  1st Qu.: 0.000   1st Qu.:24.9   yes: 66   yes: 52  
##  Median : 2.000   Median :26.1                      
##  Mean   : 2.919   Mean   :26.3                      
##  3rd Qu.: 5.000   3rd Qu.:27.7                      
##  Max.   :15.000   Max.   :33.5

Horseshoe Crabs

Poisson Regression

preg <- glm(data=crabs,formula=Satellites ~ Width * Dark * GoodSpine,family="poisson"); summary(preg)
## 
## Call:
## glm(formula = Satellites ~ Width * Dark * GoodSpine, family = "poisson", 
##     data = crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9448  -1.9738  -0.4940   0.9552   4.6511  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -3.41436    1.00512  -3.397 0.000681 ***
## Width                       0.17127    0.03656   4.685 2.81e-06 ***
## Darkyes                    -1.04896    1.65607  -0.633 0.526472    
## GoodSpineyes                2.26862    1.32812   1.708 0.087610 .  
## Width:Darkyes               0.02991    0.06200   0.482 0.629544    
## Width:GoodSpineyes         -0.08400    0.04850  -1.732 0.083293 .  
## Darkyes:GoodSpineyes       -7.40779    3.48306  -2.127 0.033436 *  
## Width:Darkyes:GoodSpineyes  0.27509    0.12655   2.174 0.029723 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 549.49  on 165  degrees of freedom
## AIC: 920.79
## 
## Number of Fisher Scoring iterations: 6

Poisson Regression