2016-02-09

Project

  • I have a secret… …your project might not work.
  • That is okay. Prove to me and to your classmates that:
    • You thoroughly understand the substantive area and problem
    • You thoroughly understand the data
    • You know what methods are reasonable to try and why
    • You tried several and evaluated them rigorously, but your predictions are just not that good.
  • You can’t get blood from a turnip. (But prove it.)

Relationships between variables

  • Random vectors or vector-valued random variables.
  • Variables that occur together in some meaningful sense.

Joint distribution


library(knitr);
kable(faithful)
eruptions waiting
3.600 79
1.800 54
3.333 74
2.283 62
4.533 85
2.883 55
4.700 88
3.600 85
1.950 51
4.350 85
1.833 54
3.917 84
4.200 78
1.750 47
4.700 83
2.167 52
1.750 62
4.800 84
1.600 52
4.250 79
1.800 51
1.750 47
3.450 78
3.067 69
4.533 74
3.600 83
1.967 55
4.083 76
3.850 78
4.433 79
4.300 73
4.467 77
3.367 66
4.033 80
3.833 74
2.017 52
1.867 48
4.833 80
1.833 59
4.783 90
4.350 80
1.883 58
4.567 84
1.750 58
4.533 73
3.317 83
3.833 64
2.100 53
4.633 82
2.000 59
4.800 75
4.716 90
1.833 54
4.833 80
1.733 54
4.883 83
3.717 71
1.667 64
4.567 77
4.317 81
2.233 59
4.500 84
1.750 48
4.800 82
1.817 60
4.400 92
4.167 78
4.700 78
2.067 65
4.700 73
4.033 82
1.967 56
4.500 79
4.000 71
1.983 62
5.067 76
2.017 60
4.567 78
3.883 76
3.600 83
4.133 75
4.333 82
4.100 70
2.633 65
4.067 73
4.933 88
3.950 76
4.517 80
2.167 48
4.000 86
2.200 60
4.333 90
1.867 50
4.817 78
1.833 63
4.300 72
4.667 84
3.750 75
1.867 51
4.900 82
2.483 62
4.367 88
2.100 49
4.500 83
4.050 81
1.867 47
4.700 84
1.783 52
4.850 86
3.683 81
4.733 75
2.300 59
4.900 89
4.417 79
1.700 59
4.633 81
2.317 50
4.600 85
1.817 59
4.417 87
2.617 53
4.067 69
4.250 77
1.967 56
4.600 88
3.767 81
1.917 45
4.500 82
2.267 55
4.650 90
1.867 45
4.167 83
2.800 56
4.333 89
1.833 46
4.383 82
1.883 51
4.933 86
2.033 53
3.733 79
4.233 81
2.233 60
4.533 82
4.817 77
4.333 76
1.983 59
4.633 80
2.017 49
5.100 96
1.800 53
5.033 77
4.000 77
2.400 65
4.600 81
3.567 71
4.000 70
4.500 81
4.083 93
1.800 53
3.967 89
2.200 45
4.150 86
2.000 58
3.833 78
3.500 66
4.583 76
2.367 63
5.000 88
1.933 52
4.617 93
1.917 49
2.083 57
4.583 77
3.333 68
4.167 81
4.333 81
4.500 73
2.417 50
4.000 85
4.167 74
1.883 55
4.583 77
4.250 83
3.767 83
2.033 51
4.433 78
4.083 84
1.833 46
4.417 83
2.183 55
4.800 81
1.833 57
4.800 76
4.100 84
3.966 77
4.233 81
3.500 87
4.366 77
2.250 51
4.667 78
2.100 60
4.350 82
4.133 91
1.867 53
4.600 78
1.783 46
4.367 77
3.850 84
1.933 49
4.500 83
2.383 71
4.700 80
1.867 49
3.833 75
3.417 64
4.233 76
2.400 53
4.800 94
2.000 55
4.150 76
1.867 50
4.267 82
1.750 54
4.483 75
4.000 78
4.117 79
4.083 78
4.267 78
3.917 70
4.550 79
4.083 70
2.417 54
4.183 86
2.217 50
4.450 90
1.883 54
1.850 54
4.283 77
3.950 79
2.333 64
4.150 75
2.350 47
4.933 86
2.900 63
4.583 85
3.833 82
2.083 57
4.367 82
2.133 67
4.350 74
2.200 54
4.450 83
3.567 73
4.500 73
4.150 88
3.817 80
3.917 71
4.450 83
2.000 56
4.283 79
4.767 78
4.533 84
1.850 58
4.250 83
1.983 43
2.250 60
4.750 75
4.117 81
2.150 46
4.417 90
1.817 46
4.467 74

Correlation (JWHT 2.3,3.1.3)


Pearson Correlation





\[ \rho_{X,Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X\sigma_Y} \]

Pearson Correlation: "Plugin" Estimate





\[ r_{X,Y} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2}\sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} \]

Sample Correlation

##           eruptions   waiting
## eruptions 1.0000000 0.9008112
## waiting   0.9008112 1.0000000

Correlation Gotchas


Joint distribution - Density


Marginal distributions - Densities


Marginal distributions - Rug Plot


Conditional distributions


Conditional distributions


Conditional distributions


Conditional distributions


Conditional distributions


Conditional distributions


Conditional distributions


Conditional distributions


Independence


  • Two random variables \(X\) and \(Y\) that are part of a random vector are independent iff:

\[ F_{X,Y}(x,y) = F_X(x)F_Y(y) \]

  • Consequences:

\[ \Pr(X=x|Y=y) = \Pr(X=x) \]

\[ \Pr(Y=y|X=x) = \Pr(Y=y) \]

  • Any conditional distribution of \(X\) is the same as the marginal distribution of \(X\)
  • Knowing about \(Y\) provides no information about \(X\).

Independence vs. Correlation

##          x        y
## x 1.000000 0.703538
## y 0.703538 1.000000

Independence vs. Correlation

##             x           y
## x  1.00000000 -0.01458965
## y -0.01458965  1.00000000

Independence vs. Correlation

##             x           y
## x 1.000000000 0.003657629
## y 0.003657629 1.000000000

Independence vs. Correlation

##             x           y
## x  1.00000000 -0.01608683
## y -0.01608683  1.00000000

Predicting Waiting Time


## Mean: 70.90

Conditional predictions

  • If I know eruption time, can I do better?

## Mean: 55.60

Conditional predictions

  • If I know eruption time, can I do better?

## Mean: 81.33

Conditional predictions?

  • If I know eruption time, can I do better?

Supervised Learning Framework (HTF 2, JWHT 2)

Training experience: a set of labeled examples of the form

\[\langle x_1,\,x_2,\,\dots x_p,y\rangle,\]

where \(x_j\) are feature values and \(y\) is the output

  • Task: Given a new \(x_1,\,x_2,\,\dots x_p\), predict \(y\)

What to learn: A function \(f:\mathcal{X}_1 \times \mathcal{X}_2 \times \cdots \times \mathcal{X}_p \rightarrow \mathcal{Y}\), which maps the features into the output domain

  • Goal: Make accurate future predictions (on unseen data)
  • Plan: Learn to make accurate predictions on the training data

Wisconsin Breast Cancer Prognostic Data

Cell samples were taken from tumors in breast cancer patients before surgery and imaged; tumors were excised; patients were followed to determine whether or not the cancer recurred, and how long until recurrence or disease free.

Wisconsin data (continued)

  • 198 instances, 32 features for prediction
  • Outcome (R=recurrence, N=non-recurrence)
  • Time (until recurrence, for R, time healthy, for N).
Radius.Mean Texture.Mean Perimeter.Mean Outcome Time
18.02 27.60 117.50 N 31
17.99 10.38 122.80 N 61
21.37 17.44 137.50 N 116
11.42 20.38 77.58 N 123
20.29 14.34 135.10 R 27
12.75 15.29 84.60 R 77

Terminology

Radius.Mean Texture.Mean Perimeter.Mean Outcome Time
18.02 27.60 117.50 N 31
17.99 10.38 122.80 N 61
21.37 17.44 137.50 N 116
11.42 20.38 77.58 N 123
20.29 14.34 135.10 R 27
12.75 15.29 84.60 R 77


  • Columns are called input variables or features or attributes
  • The outcome and time (which we are trying to predict) are called labels or output variables or targets
  • A row in the table is called training example or instance
  • The whole table is called (training) data set.

Prediction problems

Radius.Mean Texture.Mean Perimeter.Mean Outcome Time
18.02 27.60 117.50 N 31
17.99 10.38 122.80 N 61
21.37 17.44 137.50 N 116
11.42 20.38 77.58 N 123
20.29 14.34 135.10 R 27
12.75 15.29 84.60 R 77


  • The problem of predicting the recurrence is called (binary) classification
  • The problem of predicting the time is called regression

More formally

  • The \(i\)th training example has the form: \(\langle x_{1,i}, \dots x_{p,i}, y_i\rangle\) where \(p\) is the number of features (32 in our case).

  • Notation \({\bf x}_i\) denotes a column vector with elements \(x_{1,i},\dots x_{p,i}\).

  • The training set \(D\) consists of \(n\) training examples

  • We denote the \(n\times p\) matrix of features by \(X\) and the size-\(n\) column vector of outputs from the data set by \({\mathbf{y}}\).

  • In statistics, \(X\) is called the data matrix or the design matrix.

  • \({{\cal X}}\) denotes space of input values

  • \({{\cal Y}}\) denotes space of output values

Supervised learning problem

  • Given a data set \(D \subset ({{\cal X}}\times {{\cal Y}})^n\), find a function: \[h : {{\cal X}}\rightarrow {{\cal Y}}\] such that \(h({\bf x})\) is a “good predictor” for the value of \(y\).

  • \(h\) is called a predictive model or hypothesis

  • Problems are categorized by the type of output domain

    • If \({{\cal Y}}=\mathbb{R}\), this problem is called regression

    • If \({{\cal Y}}\) is a finite discrete set, the problem is called classification

    • If \({{\cal Y}}\) has 2 elements, the problem is called binary classification

Steps to solving a supervised learning problem

  1. Decide what the input-output pairs are.

  2. Decide how to encode inputs and outputs.

    This defines the input space \({{\cal X}}\), and the output space \({{\cal Y}}\).

    (We will discuss this in detail later)

  3. Choose model space/hypothesis class \({{\cal H}}\) .

Example: Choosing a model space

Linear hypothesis (HTF 3, JWHT 3)

  • Suppose \(y\) was a linear function of \({\bf x}\): \[h_{\bf w}({\bf x}) = w_0 + w_1 x_1 + w_2 x_2 + \cdots\]

  • \(w_i\) are called parameters or weights (often \(\beta_i\) in stats books)

  • Typically include an attribute \(x_0=1\) (also called bias term or intercept term) so that the number of weights is \(p+1\). We then write: \[h_{\bf w}({\bf x}) = \sum_{i=0}^p w_i x_i = {\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}\] where \({\bf w}\) and \({\bf x}\) are column vectors of size \(p+1\).

  • The design matrix \(X\) is now \(n\) by \(p+1\).

Example: Design matrix with bias term

x0 x1 y
1 0.86 2.49
1 0.09 0.83
1 -0.85 -0.25
1 0.87 3.10
1 -0.44 0.87
1 -0.43 0.02
1 -1.10 -0.12
1 0.40 1.81
1 -0.96 -0.83
1 0.17 0.43
Models will be of the form \[ \begin{align} h_{\mathbf{w}}({\mathbf{x}}) & = x_0 w_0 + x_1 w_1\\ & = w_0 + x_1 w_1 \end{align} \] How should we pick \({\mathbf{w}}\)?

Error minimization

  • Intuitively, \(\bf w\) should make the predictions of \(h_{\bf w}\) close to the true values \(y_i\) on on the training data

  • Define an error function or cost function to measure how much our prediction differs from the “true” answer on the training data

  • Pick \(\bf w\) such that the error function is minimized

  • Hopefully, new examples are somehow “similar” to the training examples, and will also have small error.

How should we choose the error function?

Least mean squares (LMS)

  • Main idea: try to make \(h_{\bf w}({\bf x})\) close to \(y\) on the examples in the training set

  • We define a sum-of-squares error function \[J({\bf w}) = \frac{1}{2}\sum_{i=1}^n (h_{\bf w}({\bf x}_i)-y_i)^2\] (the \(1/2\) is just for convenience)

  • We will choose \(\bf w\) such as to minimize \(J(\bf w)\)

  • One way to do it: compute \(\bf w\) such that: \[\frac{\partial}{\partial w_j}J({\bf w}) = 0,\,\, \forall j=0\dots p\]

Example: \(w_0=0.9,w_1=-0.4\)

## SSE: 21.510

OLS Fit to Example Data

mod <- lm(y ~ x1, data=exb); print(mod$coefficients)
## (Intercept)          x1 
##    1.058813    1.610168
## SSE: 2.240

Solving a supervised learning problem

  1. Decide what the input-output pairs are.

  2. Decide how to encode inputs and outputs.

    This defines the input space \({{\cal X}}\), and the output space \({{\cal Y}}\).

  3. Choose a class of models/hypotheses \({{\cal H}}\) .

  4. Choose an error function (cost function) to define the best model in the class

  5. Choose an algorithm for searching efficiently through the space of models to find the best.

Recurrence Time from Tumor Radius

mod <- lm(Time ~ Radius.Mean, data=bc %>% filter(Outcome == 'R')); print(mod$coefficients)
## (Intercept) Radius.Mean 
##   83.161238   -3.156896

Notation reminder

  • Consider a function \(J(u_1,u_2,\ldots,u_p):\mathbb{R}^p\mapsto\mathbb{R}\) (for us, this will usually be an error function)

  • The gradient \(\nabla J(u_1,u_2,\ldots,u_p):\mathbb{R}^p\mapsto\mathbb{R}^p\) is a function which outputs a vector containing the partial derivatives.
    That is: \[\nabla J = \left\langle{\frac{\partial}{\partial u_1}}J,{\frac{\partial}{\partial u_2}}J,\ldots,{\frac{\partial}{\partial u_p}}J\right\rangle\]

  • If \(J\) is differentiable and convex, we can find the global minimum of \(J\) by solving \(\nabla J = \mathbf{0}\).

  • The partial derivative is the derivative along the \(u_i\) axis, keeping all other variables fixed.

The Least Squares Solution (HTF 2.6, 3.2, JWHT 3.1)

  • Recalling some multivariate calculus: \[\begin{aligned} \nabla_{\mathbf{w}}J & = & \nabla_{\mathbf{w}}(X{\mathbf{w}}-{{\mathbf{y}}})^{{\mathsf{T}}}(X{\mathbf{w}}-{{\mathbf{y}}}{}) \\ & = & \nabla_{\mathbf{w}}({\mathbf{w}}^{\mathsf{T}}X^{\mathsf{T}}-{{\mathbf{y}}^{\mathsf{T}}})(X{\mathbf{w}}-{{\mathbf{y}}}{}) \\ & = & \nabla_{\mathbf{w}}({\mathbf{w}}^{{\mathsf{T}}}X^{{\mathsf{T}}}X{\mathbf{w}}-{{\mathbf{y}}}^{{\mathsf{T}}}X{\mathbf{w}}-{\mathbf{w}}^{{\mathsf{T}}}X^{{\mathsf{T}}}{{\mathbf{y}}}{}+{{\mathbf{y}}}{}^{{\mathsf{T}}}{{\mathbf{y}}}{}) \\ & = & \nabla_{\mathbf{w}}({\mathbf{w}}^{{\mathsf{T}}}X^{{\mathsf{T}}}X{\mathbf{w}}-2{{\mathbf{y}}}^{{\mathsf{T}}}X{\mathbf{w}}+{{\mathbf{y}}}{}^{{\mathsf{T}}}{{\mathbf{y}}}{}) \\ & = & 2 X^{{\mathsf{T}}}X {\mathbf{w}}- 2 X^{{\mathsf{T}}}{{\mathbf{y}}} \end{aligned}\]

  • Setting gradient equal to zero: \[\begin{aligned} 2 X^{{\mathsf{T}}}X {\mathbf{w}}- 2 X^{{\mathsf{T}}}{{\mathbf{y}}}{} & = & 0 \\ \Rightarrow X^{{\mathsf{T}}}X {\mathbf{w}}& = & X^{{\mathsf{T}}}{{\mathbf{y}}}{} \\ \Rightarrow {\mathbf{w}}= (X^{{\mathsf{T}}}X)^{-1}X^{{\mathsf{T}}}{{\mathbf{y}}}{}\end{aligned}\]

  • The inverse exists if the columns of \(X\) are linearly independent.

Example of linear regression

x0 x1 y
1 0.86 2.49
1 0.09 0.83
1 -0.85 -0.25
1 0.87 3.10
1 -0.44 0.87
1 -0.43 0.02
1 -1.10 -0.12
1 0.40 1.81
1 -0.96 -0.83
1 0.17 0.43

\(h_{\bf w} ({\bf x}) = 1.06 + 1.61 x_1\)

Data matrices

\[X=\left[\begin{array}{rr} 1 & 0.86 \\ 1 & 0.09 \\ 1 & -0.85 \\ 1 & 0.87 \\ 1 & -0.44 \\ 1 & -0.43 \\ 1 & -1.10 \\ 1 & 0.40 \\ 1 & -0.96 \\ 1 & 0.17 \end{array}\right]~~~~~{{\mathbf{y}}}{}=\left[\begin{array}{r} 2.49 \\ 0.83 \\ -0.25 \\ 3.10 \\ 0.87 \\ 0.02 \\ -0.12 \\ 1.81 \\ -0.83 \\ 0.43 \end{array}\right]\]

\(X^{{\mathsf{T}}}X\)

\[X^{{\mathsf{T}}}X =\] \[{\tiny \left[\begin{array}{rrrrrrrrrr} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0.86 & 0.09 & -0.85 & 0.87 & -0.44 & -0.43 & -1.10 & 0.40 & -0.96 & 0.17 \end{array}\right]\times\left[\begin{array}{cc} 1 & 0.86 \\ 1 & 0.09 \\ 1 & -0.85 \\ 1 & 0.87 \\ 1 & -0.44 \\ 1 & -0.43 \\ 1 & -1.10 \\ 1 & 0.40 \\ 1 & -0.96 \\ 1 & 0.17 \end{array}\right]}\] \[=\left[\begin{array}{rr} 10 & -1.39 \\ -1.39 & 4.95 \end{array}\right]\]

\(X^{{\mathsf{T}}}{{\mathbf{y}}}{}\)

\[X^{{\mathsf{T}}}{{\mathbf{y}}}{}=\] \[{\tiny \left[\begin{array}{rrrrrrrrrr} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0.86 & 0.09 & -0.85 & 0.87 & -0.44 & -0.43 & -1.10 & 0.40 & -0.96 & 0.17 \end{array}\right]\times\left[\begin{array}{c} 2.49 \\ 0.83 \\ -0.25 \\ 3.10 \\ 0.87 \\ 0.02 \\ -0.12 \\ 1.81 \\ -0.83 \\ 0.43 \end{array}\right]}\] \[=\left[\begin{array}{r} 8.34 \\ 6.49 \end{array}\right]\]

Solving for \({\bf w}\)

\[{\bf w}=(X^{{\mathsf{T}}}X)^{-1}X^{{\mathsf{T}}}{{\mathbf{y}}}{} = \left[\begin{array}{rr} 10 & -1.39 \\ -1.39 & 4.95 \end{array}\right]^{-1}\left[\begin{array}{r} 8.34 \\ 6.49 \end{array}\right] = \left[\begin{array}{r} 1.06 \\ 1.61 \end{array}\right]\]

So the best fit line is \(y=1.06 + 1.61x\).

Linear regression summary

  • The optimal solution (minimizing sum-squared-error) can be computed in polynomial time in the size of the data set.

  • The solution is \({\bf w}=(X^{{\mathsf{T}}}X)^{-1}X^{{\mathsf{T}}}{{\mathbf{y}}}{}\), where \(X\) is the data matrix augmented with a column of ones, and \({{\mathbf{y}}}{}\) is the column vector of target outputs.

  • A very rare case in which an analytical, exact solution is possible

Is linear regression enough?

  • Linear regression should be the first thing you try for real-valued outputs!

  • …but it is sometimes not expressive enough.

  • Two possible solutions:

    1. Explicitly transform the data, i.e. create additional features

      • Add cross-terms, higher-order terms

      • More generally, apply a transformation of the inputs from \({\cal X}\) to some other space \({\cal X}'\), then do linear regression in the transformed space

    2. Use a different model space/hypothesis class

  • Idea (1) and idea (2) are two views of the strategy. Today we focus on the first approach

Polynomial fits (HTF 2.6, JWHT 7.1)

  • Suppose we want to fit a higher-degree polynomial to the data.
    (E.g., \(y=w_0 + w_1x_1+w_2 x_1^2\).)

  • Suppose for now that there is a single input variable \(x_{i,1}\) per training sample.

  • How do we do it?

Answer: Polynomial regression

  • Given data: \((x_{1,1},y_1), (x_{1,2},y_2), \ldots, (x_{1,n},y_n)\).

  • Suppose we want a degree-\(d\) polynomial fit.

  • Let \({{\mathbf{y}}}{}\) be as before and let \[X=\left[\begin{array}{rrrrr} 1 & x_{1,1} & x_{1,1}^2 & \ldots & x_{1,1}^d \\ 1 & x_{1,2} & x_{1,2}^2 & \ldots & x_{1,2}^d \\ \vdots & & \vdots & \vdots & \vdots \\ 1 & x_{1,n} & x_{1,n}^2 & \ldots & x_{1,n}^d \\ \end{array}\right]\]

  • We are making up features to add to our design matrix

  • Solve the linear regression \(X{\bf w}\approx {{\mathbf{y}}}{}\).

Example of quadratic regression: Data matrices

\[X=\left[\begin{array}{rrr} 1 & 0.86 & 0.75 \\ 1 & 0.09 & 0.01 \\ 1 & -0.85 & 0.73 \\ 1 & 0.87 & 0.76 \\ 1 & -0.44 & 0.19 \\ 1 & -0.43 & 0.18 \\ 1 & -1.10 & 1.22 \\ 1 & 0.40 & 0.16 \\ 1 & -0.96 & 0.93 \\ 1 & 0.17 & 0.03 \end{array}\right]~~~~~{{\mathbf{y}}}{}=\left[\begin{array}{r} 2.49 \\ 0.83 \\ -0.25 \\ 3.10 \\ 0.87 \\ 0.02 \\ -0.12 \\ 1.81 \\ -0.83 \\ 0.43 \end{array}\right]\]

\(X^{{\mathsf{T}}}X\)

\[X^{{\mathsf{T}}}X =\] \[{\tiny \hspace{-0.3in}\left[\begin{array}{rrrrrrrrrr} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0.86 & 0.09 & -0.85 & 0.87 & -0.44 & -0.43 & -1.10 & 0.40 & -0.96 & 0.17 \\ 0.75 & 0.01 & 0.73 & 0.76 & 0.19 & 0.18 & 1.22 & 0.16 & 0.93 & 0.03 \end{array}\right]\times\left[\begin{array}{rrr} 1 & 0.86 & 0.75 \\ 1 & 0.09 & 0.01 \\ 1 & -0.85 & 0.73 \\ 1 & 0.87 & 0.76 \\ 1 & -0.44 & 0.19 \\ 1 & -0.43 & 0.18 \\ 1 & -1.10 & 1.22 \\ 1 & 0.40 & 0.16 \\ 1 & -0.96 & 0.93 \\ 1 & 0.17 & 0.03 \end{array}\right]}\] \[=\left[\begin{array}{rrr} 10 & -1.39 & 4.95 \\ -1.39 & 4.95 & 1.64 \\ 4.95 & 1.64 & 4.11 \end{array}\right]\]

\(X^{{\mathsf{T}}}{{\mathbf{y}}}{}\)

\[X^{{\mathsf{T}}}{{\mathbf{y}}}{}=\] \[{\tiny \hspace{-0.3in}\left[\begin{array}{rrrrrrrrrr} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0.86 & 0.09 & -0.85 & 0.87 & -0.44 & -0.43 & -1.10 & 0.40 & -0.96 & 0.17 \\ 0.75 & 0.01 & 0.73 & 0.76 & 0.19 & 0.18 & 1.22 & 0.16 & 0.93 & 0.03\\ \end{array}\right]\times\left[\begin{array}{r} 2.49 \\ 0.83 \\ -0.25 \\ 3.10 \\ 0.87 \\ 0.02 \\ -0.12 \\ 1.81 \\ -0.83 \\ 0.43 \end{array}\right]}\] \[=\left[\begin{array}{r} 8.34 \\ 6.49 \\ 3.60 \end{array}\right]\]

Solving for \({\bf w}\)

\[{\bf w}=(X^{{\mathsf{T}}}X)^{-1}X^{{\mathsf{T}}}{{\mathbf{y}}}{} = {\tiny \left[\begin{array}{rrr} 10 & -1.39 & 4.95 \\ -1.39 & 4.95 & 1.64 \\ 4.95 & 1.64 & 4.11 \\ \end{array}\right]^{-1}\left[\begin{array}{r} 3.60 \\ 6.49 \\ 8.34 \end{array}\right] = \left[\begin{array}{r} 0.74 \\ 1.75 \\ 0.69 \end{array}\right]}\]

So the best order-2 polynomial is \(y=0.74 + 1.75 x + 0.69x^2\).

Data and linear fit

## (Intercept)           x 
##         1.1         1.6

Data and quadratic fit

## (Intercept)           x      I(x^2) 
##        0.74        1.75        0.69

Is this a better fit to the data?

Order-3 fit

## (Intercept)           x      I(x^2)      I(x^3) 
##        0.71        1.39        0.80        0.46

Is this a better fit to the data?

Order-4 fit

## (Intercept)           x      I(x^2)      I(x^3)      I(x^4) 
##       0.795       1.128      -0.039       0.905       0.898

Is this a better fit to the data?

Order-5 fit

## (Intercept)           x      I(x^2)      I(x^3)      I(x^4)      I(x^5) 
##        0.47        0.62        4.86        6.75       -5.25       -6.72

Is this a better fit to the data?

Order-6 fit

## (Intercept)           x      I(x^2)      I(x^3)      I(x^4)      I(x^5)      I(x^6) 
##        0.13        3.13        8.99      -11.11      -23.83       12.52       18.38

Is this a better fit to the data?

Order-7 fit

## (Intercept)           x      I(x^2)      I(x^3)      I(x^4)      I(x^5)      I(x^6)      I(x^7) 
##       0.096       3.207      10.193     -11.078     -30.742       8.263      25.527       5.483

Is this a better fit to the data?

Order-8 fit

## (Intercept)           x      I(x^2)      I(x^3)      I(x^4)      I(x^5)      I(x^6)      I(x^7)      I(x^8) 
##         1.3        -5.9        -5.1        69.9        48.8      -172.0      -131.9       123.3       101.2

Is this a better fit to the data?

Order-9 fit

## (Intercept)           x      I(x^2)      I(x^3)      I(x^4)      I(x^5)      I(x^6)      I(x^7)      I(x^8)      I(x^9) 
##        -1.1        34.8      -127.9      -379.9      1186.9      1604.8     -2475.4     -2627.6      1499.6      1448.1

Is this a better fit to the data?

Evaluating Performance

Which do you prefer and why?

Performance of a Fixed Hypothesis

(HTF 7.1–7.4, JWHT 2.2, 5)
  • Assume data \(({{{\mathbf{x}}}},y)\) are drawn from some fixed distribution

  • Given a model \(h\), (which could have come from anywhere), its generalization error is: \[J^*_h = E[L(h({{\bf X}}),Y)]\]

  • Given a test set of data, we can compute the test error \[\hat{J}^*_h = \frac{1}{n}\sum_{i=1}^n L(h({{{\mathbf{x}}}}_i),y_i)\]

  • \(\hat{J}^*_h\) is an unbiased estimate of \(J^*_h\) so long as the test data do not influence the choice of \(h\).

  • Can use \(\hat{J}^*_h\) with CLT or bootstrap to get a confidence interval for \(J^*_h\).

Test Error: The Gold Standard

\[\hat{J}^*_h = \frac{1}{n}\sum_{i=1}^n L(h({{{\mathbf{x}}}}_i),y_i)\]

  • \(\hat{J}^*_h\) is an unbiased estimate of \(J^*_h\) so long as the \(({{{\mathbf{x}}}}_i,y_i)\) do not influence \(h\). Can use \(\hat{J}^*_h\) to get a confidence interval for \(J^*_h\).

  • Gives a strong statistical guarantee about the true performance of our system, if we didn’t use the test data to choose \(h\).

  • Note we can write training error for model class \(\mathcal{H}\) as

\[\hat{J}_\mathcal{H} = \min_{h' \in \mathcal{H}} \frac{1}{n}\sum_{i=1}^n L(h'({{{\mathbf{x}}}}_i),y_i)\]

  • Obviously, for any data set, \(\hat{J}_\mathcal{H} \le \hat{J}^*_h\).

Model Selection and Performance

  1. We would like to estimate the generalization error of our resulting predictor.

  2. We would like to choose the best model space (e.g. linear, quadratic, …)

Problem 1: Estimating Generalization Error

Training error \(\hat{J}_\mathcal{H}\) systematically underestimates generalization error \(J^*_h\)

Problem 2: Model Selection

The more complex the model, the smaller the training error.

  • Training error of the degree-9 polynomial is 0.

  • Training error of the degree-9 polynomial on any set of 10 points is 0.

Problem 2: Model Selection

Smaller training error does not mean smaller generalization error.

  • Suppose \(\mathcal{H}_1\) is the space of all linear functions, \(\mathcal{H}_2\) is the space of all quadratic functions. Note \(\mathcal{H}_1 \subset \mathcal{H}_2\).

  • Fix a training set.

  • Let \(h_1 = \arg\min_{h' \in \mathcal{H}_1} \hat{J}^*_{h'}\) and \(h_2 = \arg\min_{h' \in \mathcal{H}_2} \hat{J}^*_{h'}\)

  • We must have \(\hat{J}^*_{h_2} \le \hat{J}^*_{h_1}\), but we may have \(J^*_{h_2} > J^*_{h_1}\).

Problem 2: Model Selection

Training error is no good for choosing the model space.

Fixing Problem 1: Generalization Error

Training error \(\hat{J}_\mathcal{H}\) underestimates generalization error \(J^*_h\)

  • If you really want a good estimate of \(J^*_h\), you need a separate test set

  • (But new stat methods can produce a CI using training error)

  • Could report test error, then deploy whatever you train on the whole data. (Probably won’t be worse.)

Fixing Problem 2: Model Selection

Smaller training error does not mean smaller generalization error.

  • Small training error, large generalization error is known as overfitting

  • A separate validation set can be used for model selection.
    • Train on the training set using each proposed model space
    • evaluate each on the validation set, choose the one with lowest validation error

Training, Model Selection, and Error Estimation

  • A general procedure for estimating the true error of a specific learned model using model selection

  • The data is randomly partitioned into three disjoint subsets:

    • A training set used only to find the parameters \({\bf w}\)

    • A validation set used to find the right model space (e.g., the degree of the polynomial)

    • A test set used to report the prediction error of the algorithm

  • Can generate standard confidence intervals for the test error of the learned model

Problems with the Single-Partition Approach

  • Pros:

    • Measures what we want: Performance of the actual learned model.
  • Cons:

    • Smaller effective training sets make performance more variable.

    • Small validation sets can give poor model selection

    • Small test sets can give poor estimates of performance

    • For a test set of size 100, with 60 correct classifications, 95% C.I. for actual accuracy is \((0.497,0.698)\).

k-fold cross-validation
(HTF 7.10, JWHT 5.1)

  • Divide the instances into \(k\) disjoint partitions or folds of size \(n/k\)

  • Loop through the partitions \(i = 1 ... k\):

    • Partition \(i\) is for evaluation (i.e., estimating the performance of the algorithm after learning is done)

    • The rest are used for training (i.e., choosing the specific model within the space)

  • "Cross-Validation Error" is the average error on the evaluation partitions. Has lower variance than error on one partition.

  • This is the main CV idea; CV is used for different purposes though.

Misuse of CV (HTF 7.10.2)

  • \(n=50\) examples, binary classification, balanced classes

  • \(p = 5000\) features, all statistically independent of \(y\)

  • Use model selection to find best \(100\) features by correlation on entire dataset.

  • Use cross-validation with these \(p = 100\) to estimate error.

  • CV-based error rate was 3%.

k-fold cross-validation model selection
(HTF 7.10, JWHT 5.1)

  • Divide the instances into \(k\) folds of size \(n/k\).

  • Loop over \(m\) model spaces \(1 ... m\)

    • Loop over the \(k\) folds \(i = 1 ... k\):

      • Fold \(i\) is for validation (i.e., estimating the performance of the algorithm after learning is done)

      • The rest are used for training (i.e., choosing the specific model within the space)

  • For each model space, report average error over folds, and standard error.

CV for Model Selection

CV for Model Selection

CV for Model Selection

CV for Model Selection

CV for Model Selection

CV for Model Selection

CV for Model Selection

  • Typically select "most parsimonious model whose error is no more than one standard error above the error of the best model." (HTF)

Estimating "which is best"
vs. "performance of best"

Nested CV for Model Evaluation

  • Divide the instances into \(k\) "outer" folds of size \(n/k\).

  • Loop over the \(k\) outer folds \(i = 1 ... k\):
    • Fold \(i\) is for testing; all others for training.
    • Divide the training instances into \(k'\) "inner" folds of size \((n - n/k)/k'\).
    • Loop over \(m\) model spaces \(1 ... m\)

      • Loop over the \(k'\) inner folds \(j = 1 ... k'\):
        • Fold \(j\) is for validation
        • The rest are used for training
      • Use average error over folds and SE to choose model space.

      • Train on all inner folds.

    • Test the model on outer test fold

Nested CV for Model Evaluation

Generalization Error for degree 3 model

Minimum-CV Estimate: 128.48, Nested CV Estimate: 149.91

Bias-correction for the CV Procedure

Summary

  • The training error decreases with the degree of the polynomial \(M\), i.e. the complexity (size) of the model space

  • Generalization error decreases at first, then starts increasing

  • Set aside a validation set helps us find a good model space

  • We then can report unbiased error estimate, using a test set, untouched during both parameter training and validation

  • Cross-validation is a lower-variance but possibly biased version of this approach. It is standard.

  • If you have lots of data, just use held-out validation and test sets.

How much data do I need?

  • For training, as much as possible; the more complex your model space, the more data you need. (More on this later.)

  • For testing using a held-out test set, we can do a sample size calculation.

  • Required: Estimate of the true performance you think you might achieve.

library(pwr)
#ES.h computes the "effect size" -- in this difference between 50% and 60% accuracy
#power is probability of detecting a true effect, sig.level controls type I error
pwr.p.test(ES.h(0.6,0.5),power=0.80,sig.level=0.05,alternative="greater")
## 
##      proportion power calculation for binomial distribution (arcsine transformation) 
## 
##               h = 0.2
##               n = 152
##       sig.level = 0.05
##           power = 0.8
##     alternative = greater