2017-09-19

## Supervised Learning Framework (HTF 2, JWHT 2)

Training experience: a set of labeled examples of the form

$\langle x_1,\,x_2,\,\dots x_p,y\rangle,$

where $$x_j$$ are feature values and $$y$$ is the output

• Task: Given a new $$x_1,\,x_2,\,\dots x_p$$, predict $$y$$

What to learn: A function $$f:\mathcal{X}_1 \times \mathcal{X}_2 \times \cdots \times \mathcal{X}_p \rightarrow \mathcal{Y}$$, which maps the features into the output domain

• Goal: Make accurate future predictions (on unseen data)
• Plan: Learn to make accurate predictions on the training data

## Wisconsin Breast Cancer Prognostic Data

Cell samples were taken from tumors in breast cancer patients before surgery and imaged; tumors were excised; patients were followed to determine whether or not the cancer recurred, and how long until recurrence or disease free.

image

## Wisconsin data (continued)

• 198 instances, 32 features for prediction
• Outcome (R=recurrence, N=non-recurrence)
• Time (until recurrence, for R, time healthy, for N).
18.02 27.60 117.50 N 31
17.99 10.38 122.80 N 61
21.37 17.44 137.50 N 116
11.42 20.38 77.58 N 123
20.29 14.34 135.10 R 27
12.75 15.29 84.60 R 77

## Terminology

18.02 27.60 117.50 N 31
17.99 10.38 122.80 N 61
21.37 17.44 137.50 N 116
11.42 20.38 77.58 N 123
20.29 14.34 135.10 R 27
12.75 15.29 84.60 R 77
• Columns are called input variables or features or attributes
• The outcome and time (which we are trying to predict) are called labels or output variables or targets
• A row in the table is called training example or instance
• The whole table is called (training) data set.

## Prediction problems

18.02 27.60 117.50 N 31
17.99 10.38 122.80 N 61
21.37 17.44 137.50 N 116
11.42 20.38 77.58 N 123
20.29 14.34 135.10 R 27
12.75 15.29 84.60 R 77

• The problem of predicting the recurrence is called (binary) classification
• The problem of predicting the time is called regression

## More formally

• The $$i$$th training example has the form: $$\langle x_{1,i}, \dots x_{p,i}, y_i\rangle$$ where $$p$$ is the number of features (32 in our case).

• Notation $${\bf x}_i$$ denotes a column vector with elements $$x_{1,i},\dots x_{p,i}$$.

• The training set $$D$$ consists of $$n$$ training examples

• We denote the $$n\times p$$ matrix of features by $$X$$ and the size-$$n$$ column vector of outputs from the data set by $${\mathbf{y}}$$.

• In statistics, $$X$$ is called the data matrix or the design matrix.

• $${{\cal X}}$$ denotes space of input values

• $${{\cal Y}}$$ denotes space of output values

## Supervised learning problem

• Given a data set $$D \subset ({{\cal X}}\times {{\cal Y}})^n$$, find a function: $h : {{\cal X}}\rightarrow {{\cal Y}}$ such that $$h({\bf x})$$ is a “good predictor” for the value of $$y$$.

• $$h$$ is called a predictive model or hypothesis

• Problems are categorized by the type of output domain

• If $${{\cal Y}}=\mathbb{R}$$, this problem is called regression

• If $${{\cal Y}}$$ is a finite discrete set, the problem is called classification

• If $${{\cal Y}}$$ has 2 elements, the problem is called binary classification

## Steps to solving a supervised learning problem

1. Decide what the input-output pairs are.

2. Decide how to encode inputs and outputs.

This defines the input space $${{\cal X}}$$, and the output space $${{\cal Y}}$$.

(We will discuss this in detail later)

3. Choose model space/hypothesis class $${{\cal H}}$$ .

## Linear hypothesis (HTF 3, JWHT 3)

• Suppose $$y$$ was a linear function of $${\bf x}$$: $h_{\bf w}({\bf x}) = w_0 + w_1 x_1 + w_2 x_2 + \cdots$

• $$w_i$$ are called parameters or weights (often $$\beta_i$$ in stats books)

• Typically include an attribute $$x_0=1$$ (also called bias term or intercept term) so that the number of weights is $$p+1$$. We then write: $h_{\bf w}({\bf x}) = \sum_{i=0}^p w_i x_i = {\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}$ where $${\bf w}$$ and $${\bf x}$$ are column vectors of size $$p+1$$.

• The design matrix $$X$$ is now $$n$$ by $$p+1$$.

## Example: Design matrix with bias term

x0 x1 y
1 0.86 2.49
1 0.09 0.83
1 -0.85 -0.25
1 0.87 3.10
1 -0.44 0.87
1 -0.43 0.02
1 -1.10 -0.12
1 0.40 1.81
1 -0.96 -0.83
1 0.17 0.43

Models will be of the form

$h_{\mathbf{w}}({\mathbf{x}}) = x_0 w_0 + x_1 w_1 = w_0 + x_1 w_1$

How should we pick $${\mathbf{w}}$$?

## Error minimization

• Intuitively, $$\bf w$$ should make the predictions of $$h_{\bf w}$$ close to the true values $$y_i$$ on on the training data

• Define an error function or cost function to measure how much our prediction differs from the “true” answer on the training data

• Pick $$\bf w$$ such that the error function is minimized

• Hopefully, new examples are somehow “similar” to the training examples, and will also have small error.

How should we choose the error function?

## Least mean squares (LMS)

• Main idea: try to make $$h_{\bf w}({\bf x})$$ close to $$y$$ on the examples in the training set

• We define a sum-of-squares error function $J({\bf w}) = \frac{1}{2}\sum_{i=1}^n (h_{\bf w}({\bf x}_i)-y_i)^2$ (the $$1/2$$ is just for convenience)

• We will choose $$\bf w$$ such as to minimize $$J(\bf w)$$

• One way to do it: compute $$\bf w$$ such that: $\frac{\partial}{\partial w_j}J({\bf w}) = 0,\,\, \forall j=0\dots p$

## Example: $$w_0=0.9,w_1=-0.4$$

## SSE: 21.510

mod <- lm(y ~ x1, data=exb); print(mod$coefficients) ## (Intercept) x1 ## 1.058813 1.610168 ## SSE: 2.240 ## Solving a supervised learning problem: optimisation-based approach 1. Decide what the input-output pairs are. 2. Decide how to encode inputs and outputs. This defines the input space $${{\cal X}}$$, and the output space $${{\cal Y}}$$. 3. Choose a class of models/hypotheses $${{\cal H}}$$ . 4. Choose an error function (cost function) to define the best model in the class 5. Choose an algorithm for searching efficiently through the space of models to find the best. ## Recurrence Time from Tumor Radius mod <- lm(Time ~ Radius.Mean, data=bc %>% filter(Outcome == 'R')); print(mod$coefficients)
## (Intercept) Radius.Mean
##   83.161238   -3.156896

## Notation reminder

• Consider a function $$J(u_1,u_2,\ldots,u_p):\mathbb{R}^p\mapsto\mathbb{R}$$ (for us, this will usually be an error function)

• The gradient $$\nabla J(u_1,u_2,\ldots,u_p):\mathbb{R}^p\mapsto\mathbb{R}^p$$ is a function which outputs a vector containing the partial derivatives.
That is: $\nabla J = \left\langle{\frac{\partial}{\partial u_1}}J,{\frac{\partial}{\partial u_2}}J,\ldots,{\frac{\partial}{\partial u_p}}J\right\rangle$

• If $$J$$ is differentiable and convex, we can find the global minimum of $$J$$ by solving $$\nabla J = \mathbf{0}$$.

• The partial derivative is the derivative along the $$u_i$$ axis, keeping all other variables fixed.

## The Least Squares Solution (HTF 2.6, 3.2, JWHT 3.1)

• Recalling some multivariate calculus: \begin{aligned} \nabla_{\mathbf{w}}J & = & \nabla_{\mathbf{w}}(X{\mathbf{w}}-{{\mathbf{y}}})^{{\mathsf{T}}}(X{\mathbf{w}}-{{\mathbf{y}}}{}) \\ & = & \nabla_{\mathbf{w}}({\mathbf{w}}^{\mathsf{T}}X^{\mathsf{T}}-{{\mathbf{y}}^{\mathsf{T}}})(X{\mathbf{w}}-{{\mathbf{y}}}{}) \\ & = & \nabla_{\mathbf{w}}({\mathbf{w}}^{{\mathsf{T}}}X^{{\mathsf{T}}}X{\mathbf{w}}-{{\mathbf{y}}}^{{\mathsf{T}}}X{\mathbf{w}}-{\mathbf{w}}^{{\mathsf{T}}}X^{{\mathsf{T}}}{{\mathbf{y}}}{}+{{\mathbf{y}}}{}^{{\mathsf{T}}}{{\mathbf{y}}}{}) \\ & = & \nabla_{\mathbf{w}}({\mathbf{w}}^{{\mathsf{T}}}X^{{\mathsf{T}}}X{\mathbf{w}}-2{{\mathbf{y}}}^{{\mathsf{T}}}X{\mathbf{w}}+{{\mathbf{y}}}{}^{{\mathsf{T}}}{{\mathbf{y}}}{}) \\ & = & 2 X^{{\mathsf{T}}}X {\mathbf{w}}- 2 X^{{\mathsf{T}}}{{\mathbf{y}}} \end{aligned}

• Setting gradient equal to zero: \begin{aligned} 2 X^{{\mathsf{T}}}X {\mathbf{w}}- 2 X^{{\mathsf{T}}}{{\mathbf{y}}}{} & = & 0 \\ \Rightarrow X^{{\mathsf{T}}}X {\mathbf{w}}& = & X^{{\mathsf{T}}}{{\mathbf{y}}}{} \\ \Rightarrow {\mathbf{w}}= (X^{{\mathsf{T}}}X)^{-1}X^{{\mathsf{T}}}{{\mathbf{y}}}{}\end{aligned}

• The inverse exists if the columns of $$X$$ are linearly independent.

## Example of linear regression

x0 x1 y
1 0.86 2.49
1 0.09 0.83
1 -0.85 -0.25
1 0.87 3.10
1 -0.44 0.87
1 -0.43 0.02
1 -1.10 -0.12
1 0.40 1.81
1 -0.96 -0.83
1 0.17 0.43

$$h_{\bf w} ({\bf x}) = 1.06 + 1.61 x_1$$

## Data matrices

$X=\left[\begin{array}{rr} 1 & 0.86 \\ 1 & 0.09 \\ 1 & -0.85 \\ 1 & 0.87 \\ 1 & -0.44 \\ 1 & -0.43 \\ 1 & -1.10 \\ 1 & 0.40 \\ 1 & -0.96 \\ 1 & 0.17 \end{array}\right]~~~~~{{\mathbf{y}}}{}=\left[\begin{array}{r} 2.49 \\ 0.83 \\ -0.25 \\ 3.10 \\ 0.87 \\ 0.02 \\ -0.12 \\ 1.81 \\ -0.83 \\ 0.43 \end{array}\right]$

## $$X^{{\mathsf{T}}}X$$

$X^{{\mathsf{T}}}X =$ ${\tiny \left[\begin{array}{rrrrrrrrrr} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0.86 & 0.09 & -0.85 & 0.87 & -0.44 & -0.43 & -1.10 & 0.40 & -0.96 & 0.17 \end{array}\right]\times\left[\begin{array}{cc} 1 & 0.86 \\ 1 & 0.09 \\ 1 & -0.85 \\ 1 & 0.87 \\ 1 & -0.44 \\ 1 & -0.43 \\ 1 & -1.10 \\ 1 & 0.40 \\ 1 & -0.96 \\ 1 & 0.17 \end{array}\right]}$ $=\left[\begin{array}{rr} 10 & -1.39 \\ -1.39 & 4.95 \end{array}\right]$

## $$X^{{\mathsf{T}}}{{\mathbf{y}}}{}$$

$X^{{\mathsf{T}}}{{\mathbf{y}}}{}=$ ${\tiny \left[\begin{array}{rrrrrrrrrr} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0.86 & 0.09 & -0.85 & 0.87 & -0.44 & -0.43 & -1.10 & 0.40 & -0.96 & 0.17 \end{array}\right]\times\left[\begin{array}{c} 2.49 \\ 0.83 \\ -0.25 \\ 3.10 \\ 0.87 \\ 0.02 \\ -0.12 \\ 1.81 \\ -0.83 \\ 0.43 \end{array}\right]}$ $=\left[\begin{array}{r} 8.34 \\ 6.49 \end{array}\right]$

## Solving for $${\bf w}$$

${\bf w}=(X^{{\mathsf{T}}}X)^{-1}X^{{\mathsf{T}}}{{\mathbf{y}}}{} = \left[\begin{array}{rr} 10 & -1.39 \\ -1.39 & 4.95 \end{array}\right]^{-1}\left[\begin{array}{r} 8.34 \\ 6.49 \end{array}\right] = \left[\begin{array}{r} 1.06 \\ 1.61 \end{array}\right]$

So the best fit line is $$y=1.06 + 1.61x$$.

## Linear regression summary

• The optimal solution (minimizing sum-squared-error) can be computed in polynomial time in the size of the data set.

• The solution is $${\bf w}=(X^{{\mathsf{T}}}X)^{-1}X^{{\mathsf{T}}}{{\mathbf{y}}}{}$$, where $$X$$ is the data matrix augmented with a column of ones, and $${{\mathbf{y}}}{}$$ is the column vector of target outputs.

• A very rare case in which an analytical, exact solution is possible

## Is linear regression enough?

• Linear regression should be the first thing you try for real-valued outputs!

• …but it is sometimes not expressive enough.

• Two possible solutions:

1. Explicitly transform the data, i.e. create additional features

• More generally, apply a transformation of the inputs from $${\cal X}$$ to some other space $${\cal X}'$$, then do linear regression in the transformed space

2. Use a different model space/hypothesis class

• Idea (1) and idea (2) are two views of the strategy. Today we focus on the first approach

## Polynomial fits (HTF 2.6, JWHT 7.1)

• Suppose we want to fit a higher-degree polynomial to the data.
(E.g., $$y=w_0 + w_1x_1+w_2 x_1^2$$.)

• Suppose for now that there is a single input variable $$x_{i,1}$$ per training sample.

• How do we do it?

• Given data: $$(x_{1,1},y_1), (x_{1,2},y_2), \ldots, (x_{1,n},y_n)$$.

• Suppose we want a degree-$$d$$ polynomial fit.

• Let $${{\mathbf{y}}}{}$$ be as before and let $X=\left[\begin{array}{rrrrr} 1 & x_{1,1} & x_{1,1}^2 & \ldots & x_{1,1}^d \\ 1 & x_{1,2} & x_{1,2}^2 & \ldots & x_{1,2}^d \\ \vdots & & \vdots & \vdots & \vdots \\ 1 & x_{1,n} & x_{1,n}^2 & \ldots & x_{1,n}^d \\ \end{array}\right]$

• We are making up features to add to our design matrix

• Solve the linear regression $$X{\bf w}\approx {{\mathbf{y}}}{}$$.

## Example of quadratic regression: Data matrices

$X=\left[\begin{array}{rrr} 1 & 0.86 & 0.75 \\ 1 & 0.09 & 0.01 \\ 1 & -0.85 & 0.73 \\ 1 & 0.87 & 0.76 \\ 1 & -0.44 & 0.19 \\ 1 & -0.43 & 0.18 \\ 1 & -1.10 & 1.22 \\ 1 & 0.40 & 0.16 \\ 1 & -0.96 & 0.93 \\ 1 & 0.17 & 0.03 \end{array}\right]~~~~~{{\mathbf{y}}}{}=\left[\begin{array}{r} 2.49 \\ 0.83 \\ -0.25 \\ 3.10 \\ 0.87 \\ 0.02 \\ -0.12 \\ 1.81 \\ -0.83 \\ 0.43 \end{array}\right]$

## $$X^{{\mathsf{T}}}X$$

$X^{{\mathsf{T}}}X =$ ${\tiny \hspace{-0.3in}\left[\begin{array}{rrrrrrrrrr} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0.86 & 0.09 & -0.85 & 0.87 & -0.44 & -0.43 & -1.10 & 0.40 & -0.96 & 0.17 \\ 0.75 & 0.01 & 0.73 & 0.76 & 0.19 & 0.18 & 1.22 & 0.16 & 0.93 & 0.03 \end{array}\right]\times\left[\begin{array}{rrr} 1 & 0.86 & 0.75 \\ 1 & 0.09 & 0.01 \\ 1 & -0.85 & 0.73 \\ 1 & 0.87 & 0.76 \\ 1 & -0.44 & 0.19 \\ 1 & -0.43 & 0.18 \\ 1 & -1.10 & 1.22 \\ 1 & 0.40 & 0.16 \\ 1 & -0.96 & 0.93 \\ 1 & 0.17 & 0.03 \end{array}\right]}$ $=\left[\begin{array}{rrr} 10 & -1.39 & 4.95 \\ -1.39 & 4.95 & 1.64 \\ 4.95 & 1.64 & 4.11 \end{array}\right]$

## $$X^{{\mathsf{T}}}{{\mathbf{y}}}{}$$

$X^{{\mathsf{T}}}{{\mathbf{y}}}{}=$ ${\tiny \hspace{-0.3in}\left[\begin{array}{rrrrrrrrrr} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0.86 & 0.09 & -0.85 & 0.87 & -0.44 & -0.43 & -1.10 & 0.40 & -0.96 & 0.17 \\ 0.75 & 0.01 & 0.73 & 0.76 & 0.19 & 0.18 & 1.22 & 0.16 & 0.93 & 0.03\\ \end{array}\right]\times\left[\begin{array}{r} 2.49 \\ 0.83 \\ -0.25 \\ 3.10 \\ 0.87 \\ 0.02 \\ -0.12 \\ 1.81 \\ -0.83 \\ 0.43 \end{array}\right]}$ $=\left[\begin{array}{r} 8.34 \\ 6.49 \\ 3.60 \end{array}\right]$

## Solving for $${\bf w}$$

${\bf w}=(X^{{\mathsf{T}}}X)^{-1}X^{{\mathsf{T}}}{{\mathbf{y}}}{} = {\tiny \left[\begin{array}{rrr} 10 & -1.39 & 4.95 \\ -1.39 & 4.95 & 1.64 \\ 4.95 & 1.64 & 4.11 \\ \end{array}\right]^{-1}\left[\begin{array}{r} 3.60 \\ 6.49 \\ 8.34 \end{array}\right] = \left[\begin{array}{r} 0.74 \\ 1.75 \\ 0.69 \end{array}\right]}$

So the best order-2 polynomial is $$y=0.74 + 1.75 x + 0.69x^2$$.

## Data and linear fit

## (Intercept)           x
##         1.1         1.6