• Space of outputs \(\mathcal{Y}\) is finite. Often classes are given numbers starting from \(0\) or \(1\).

  • Usually no notion of “similarity” between class labels in terms of loss. Remember our loss function \(\ell(h(\mathbf{x}),y)\):
    • Regression: \(\ell(9,10)\) is better than \(\ell(1,10)\)
    • Classification: \(\ell(9,10)\) and \(\ell(1,10)\) are equally bad.
      • Or, have explicit losses for every combination of predicted class and actual class.

“Linear models” in general (HTF Ch. 2.8.3)

  • By linear models, we mean that the hypothesis function \(h_{\bf w}({\bf x})\) is a (transformed) linear function of the parameters \({\bf w}\).

  • Predictions are a (transformed) linear combination of feature values

\[h_{\bf w}(\mathbf{x}) = g\left(\sum_{k=0}^{p} w_k \phi_k(\mathbf{x})\right) = g(\boldsymbol{\phi}(\mathbf{x})^\mathsf{T}{{\mathbf{w}}})\]

  • again, \(\phi_k\) are called basis functions or feature functions As usual, we let \(\phi_0(\mathbf{x})=1, \forall \mathbf{x}\), so that we don’t force \(h_{\bf w}(0) = 0\)

Linear Methods for Classification

  • Classification tasks

  • Loss functions for classification

  • Logistic Regression

  • Support Vector Machines

Wisconsin Breast Cancer Prognostic Data

Cell samples were taken from tumors in breast cancer patients before surgery and imaged; tumors were excised; patients were followed to determine whether or not the cancer recurred, and how long until recurrence or disease free.

Wisconsin data (continued)

  • 198 instances, 32 features for prediction
  • Outcome (R=recurrence, N=non-recurrence)
  • Time (until recurrence, for R, time healthy, for N).

Example: Given nucleus radius, predict cancer recurrence

ggplot(bc,aes(Radius.Mean,fill=Outcome,color=Outcome)) + geom_density(alpha=I(1/2))

Example: Solution by linear regression

  • Univariate real input: nucleus size
  • Output coding: non-recurrence = 0, recurrence = 1
  • Sum squared error minimized by the blue line

Linear regression for classification

  • The predictor shows an increasing trend towards recurrence with larger nucleus size, as expected.

  • Output cannot be directly interpreted as a class prediction.

  • Thresholding output (e.g., at 0.5) could be used to predict 0 or 1.
    (In this case, prediction would be 0 except for extremely large nucleus size.)

Probabilistic view

  • Suppose we have two possible classes: \(y\in \{0,1\}\).

  • The symbols “\(0\)” and “\(1\)” are unimportant. Could have been \(\{a,b\}\), \(\{\mathit{up},\mathit{down}\}\), whatever.

  • Rather than try to predict the class label directly, ask:
    What is the probability that a given input \(\mathbf{x}\) has class \(y=1\)?

Aside: Relationships Between Random Variables

Conditional distributions

Conditional distributions