2018-10-16

## Classification

• Space of outputs $$\mathcal{Y}$$ is finite. Often classes are given numbers starting from $$0$$ or $$1$$.

• Usually no notion of “similarity” between class labels in terms of loss. Remember our loss function $$\ell(h(\mathbf{x}),y)$$:
• Regression: $$\ell(9,10)$$ is better than $$\ell(1,10)$$
• Classification: $$\ell(9,10)$$ and $$\ell(1,10)$$ are equally bad.
• Or, have explicit losses for every combination of predicted class and actual class.

## “Linear models” in general (HTF Ch. 2.8.3)

• By linear models, we mean that the hypothesis function $$h_{\bf w}({\bf x})$$ is a (transformed) linear function of the parameters $${\bf w}$$.

• Predictions are a (transformed) linear combination of feature values

$h_{\bf w}(\mathbf{x}) = g\left(\sum_{k=0}^{p} w_k \phi_k(\mathbf{x})\right) = g(\boldsymbol{\phi}(\mathbf{x})^\mathsf{T}{{\mathbf{w}}})$

• again, $$\phi_k$$ are called basis functions or feature functions As usual, we let $$\phi_0(\mathbf{x})=1, \forall \mathbf{x}$$, so that we don’t force $$h_{\bf w}(0) = 0$$

## Linear Methods for Classification

• Loss functions for classification

• Logistic Regression

• Support Vector Machines

## Wisconsin Breast Cancer Prognostic Data

Cell samples were taken from tumors in breast cancer patients before surgery and imaged; tumors were excised; patients were followed to determine whether or not the cancer recurred, and how long until recurrence or disease free.

## Wisconsin data (continued)

• 198 instances, 32 features for prediction
• Outcome (R=recurrence, N=non-recurrence)
• Time (until recurrence, for R, time healthy, for N).

## Example: Given nucleus radius, predict cancer recurrence

ggplot(bc,aes(Radius.Mean,fill=Outcome,color=Outcome)) + geom_density(alpha=I(1/2))

## Example: Solution by linear regression

• Univariate real input: nucleus size
• Output coding: non-recurrence = 0, recurrence = 1
• Sum squared error minimized by the blue line

## Linear regression for classification

• The predictor shows an increasing trend towards recurrence with larger nucleus size, as expected.

• Output cannot be directly interpreted as a class prediction.

• Thresholding output (e.g., at 0.5) could be used to predict 0 or 1.
(In this case, prediction would be 0 except for extremely large nucleus size.)

## Probabilistic view

• Suppose we have two possible classes: $$y\in \{0,1\}$$.

• The symbols “$$0$$” and “$$1$$” are unimportant. Could have been $$\{a,b\}$$, $$\{\mathit{up},\mathit{down}\}$$, whatever.

• Rather than try to predict the class label directly, ask:
What is the probability that a given input $$\mathbf{x}$$ has class $$y=1$$?