- Random vectors or vector-valued random variables.
- Variables that occur together in some meaningful sense.
2017-03-05
library(knitr); kable(head(faithful,10))
eruptions | waiting |
---|---|
3.600 | 79 |
1.800 | 54 |
3.333 | 74 |
2.283 | 62 |
4.533 | 85 |
2.883 | 55 |
4.700 | 88 |
3.600 | 85 |
1.950 | 51 |
4.350 | 85 |
\[ \rho_{X,Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X\sigma_Y} \]
\[ r_{X,Y} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2}\sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} \]
## eruptions waiting ## eruptions 1.0000000 0.9008112 ## waiting 0.9008112 1.0000000
\[ F_{X,Y}(x,y) = F_X(x)F_Y(y) \]
\[ \Pr(X=x|Y=y) = \Pr(X=x) \]
\[ \Pr(Y=y|X=x) = \Pr(Y=y) \]
## x y ## x 1.0000000 0.7060551 ## y 0.7060551 1.0000000
## x y ## x 1.00000000 0.01061515 ## y 0.01061515 1.00000000
## x y ## x 1.00000000 -0.00227553 ## y -0.00227553 1.00000000
## x y ## x 1.000000000 0.004527559 ## y 0.004527559 1.000000000
## Mean: 70.90
## Mean: 55.60
## Mean: 81.33
Training experience: a set of labeled examples of the form
\[\langle x_1,\,x_2,\,\dots x_p,y\rangle,\]
where \(x_j\) are feature values and \(y\) is the output
What to learn: A function \(f:\mathcal{X}_1 \times \mathcal{X}_2 \times \cdots \times \mathcal{X}_p \rightarrow \mathcal{Y}\), which maps the features into the output domain
Cell samples were taken from tumors in breast cancer patients before surgery and imaged; tumors were excised; patients were followed to determine whether or not the cancer recurred, and how long until recurrence or disease free.
Radius.Mean | Texture.Mean | Perimeter.Mean | … | Outcome | Time |
---|---|---|---|---|---|
18.02 | 27.60 | 117.50 | N | 31 | |
17.99 | 10.38 | 122.80 | N | 61 | |
21.37 | 17.44 | 137.50 | N | 116 | |
11.42 | 20.38 | 77.58 | N | 123 | |
20.29 | 14.34 | 135.10 | R | 27 | |
12.75 | 15.29 | 84.60 | R | 77 | |
… | … | … | … | … |
Radius.Mean | Texture.Mean | Perimeter.Mean | … | Outcome | Time |
---|---|---|---|---|---|
18.02 | 27.60 | 117.50 | N | 31 | |
17.99 | 10.38 | 122.80 | N | 61 | |
21.37 | 17.44 | 137.50 | N | 116 | |
11.42 | 20.38 | 77.58 | N | 123 | |
20.29 | 14.34 | 135.10 | R | 27 | |
12.75 | 15.29 | 84.60 | R | 77 | |
… | … | … | … | … |
Radius.Mean | Texture.Mean | Perimeter.Mean | … | Outcome | Time |
---|---|---|---|---|---|
18.02 | 27.60 | 117.50 | N | 31 | |
17.99 | 10.38 | 122.80 | N | 61 | |
21.37 | 17.44 | 137.50 | N | 116 | |
11.42 | 20.38 | 77.58 | N | 123 | |
20.29 | 14.34 | 135.10 | R | 27 | |
12.75 | 15.29 | 84.60 | R | 77 | |
… | … | … | … | … |
The \(i\)th training example has the form: \(\langle x_{1,i}, \dots x_{p,i}, y_i\rangle\) where \(p\) is the number of features (32 in our case).
Notation \({\bf x}_i\) denotes a column vector with elements \(x_{1,i},\dots x_{p,i}\).
The training set \(D\) consists of \(n\) training examples
We denote the \(n\times p\) matrix of features by \(X\) and the size-\(n\) column vector of outputs from the data set by \({\mathbf{y}}\).
In statistics, \(X\) is called the data matrix or the design matrix.
\({{\cal X}}\) denotes space of input values
\({{\cal Y}}\) denotes space of output values
Given a data set \(D \subset ({{\cal X}}\times {{\cal Y}})^n\), find a function: \[h : {{\cal X}}\rightarrow {{\cal Y}}\] such that \(h({\bf x})\) is a “good predictor” for the value of \(y\).
\(h\) is called a predictive model or hypothesis
Problems are categorized by the type of output domain
If \({{\cal Y}}=\mathbb{R}\), this problem is called regression
If \({{\cal Y}}\) is a finite discrete set, the problem is called classification
If \({{\cal Y}}\) has 2 elements, the problem is called binary classification
Decide what the input-output pairs are.
Decide how to encode inputs and outputs.
This defines the input space \({{\cal X}}\), and the output space \({{\cal Y}}\).
(We will discuss this in detail later)
Choose model space/hypothesis class \({{\cal H}}\) .
…
Suppose \(y\) was a linear function of \({\bf x}\): \[h_{\bf w}({\bf x}) = w_0 + w_1 x_1 + w_2 x_2 + \cdots\]
\(w_i\) are called parameters or weights (often \(\beta_i\) in stats books)
Typically include an attribute \(x_0=1\) (also called bias term or intercept term) so that the number of weights is \(p+1\). We then write: \[h_{\bf w}({\bf x}) = \sum_{i=0}^p w_i x_i = {\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}\] where \({\bf w}\) and \({\bf x}\) are column vectors of size \(p+1\).
The design matrix \(X\) is now \(n\) by \(p+1\).
|
Models will be of the form \[ \begin{align} h_{\mathbf{w}}({\mathbf{x}}) & = x_0 w_0 + x_1 w_1\\ & = w_0 + x_1 w_1 \end{align} \] How should we pick \({\mathbf{w}}\)? |
Intuitively, \(\bf w\) should make the predictions of \(h_{\bf w}\) close to the true values \(y_i\) on on the training data
Define an error function or cost function to measure how much our prediction differs from the “true” answer on the training data
Pick \(\bf w\) such that the error function is minimized
Hopefully, new examples are somehow “similar” to the training examples, and will also have small error.
How should we choose the error function?
Main idea: try to make \(h_{\bf w}({\bf x})\) close to \(y\) on the examples in the training set
We define a sum-of-squares error function \[J({\bf w}) = \frac{1}{2}\sum_{i=1}^n (h_{\bf w}({\bf x}_i)-y_i)^2\] (the \(1/2\) is just for convenience)
We will choose \(\bf w\) such as to minimize \(J(\bf w)\)
One way to do it: compute \(\bf w\) such that: \[\frac{\partial}{\partial w_j}J({\bf w}) = 0,\,\, \forall j=0\dots p\]
## SSE: 21.510
mod <- lm(y ~ x1, data=exb); print(mod$coefficients)
## (Intercept) x1 ## 1.058813 1.610168
## SSE: 2.240
Decide what the input-output pairs are.
Decide how to encode inputs and outputs.
This defines the input space \({{\cal X}}\), and the output space \({{\cal Y}}\).
Choose a class of models/hypotheses \({{\cal H}}\) .
Choose an error function (cost function) to define the best model in the class
Choose an algorithm for searching efficiently through the space of models to find the best.
mod <- lm(Time ~ Radius.Mean, data=bc %>% filter(Outcome == 'R')); print(mod$coefficients)
## (Intercept) Radius.Mean ## 83.161238 -3.156896
Consider a function \(J(u_1,u_2,\ldots,u_p):\mathbb{R}^p\mapsto\mathbb{R}\) (for us, this will usually be an error function)
The gradient \(\nabla J(u_1,u_2,\ldots,u_p):\mathbb{R}^p\mapsto\mathbb{R}^p\) is a function which outputs a vector containing the partial derivatives.
That is: \[\nabla J =
\left\langle{\frac{\partial}{\partial u_1}}J,{\frac{\partial}{\partial u_2}}J,\ldots,{\frac{\partial}{\partial u_p}}J\right\rangle\]
If \(J\) is differentiable and convex, we can find the global minimum of \(J\) by solving \(\nabla J = \mathbf{0}\).
The partial derivative is the derivative along the \(u_i\) axis, keeping all other variables fixed.
Recalling some multivariate calculus: \[\begin{aligned} \nabla_{\mathbf{w}}J & = & \nabla_{\mathbf{w}}(X{\mathbf{w}}-{{\mathbf{y}}})^{{\mathsf{T}}}(X{\mathbf{w}}-{{\mathbf{y}}}{}) \\ & = & \nabla_{\mathbf{w}}({\mathbf{w}}^{\mathsf{T}}X^{\mathsf{T}}-{{\mathbf{y}}^{\mathsf{T}}})(X{\mathbf{w}}-{{\mathbf{y}}}{}) \\ & = & \nabla_{\mathbf{w}}({\mathbf{w}}^{{\mathsf{T}}}X^{{\mathsf{T}}}X{\mathbf{w}}-{{\mathbf{y}}}^{{\mathsf{T}}}X{\mathbf{w}}-{\mathbf{w}}^{{\mathsf{T}}}X^{{\mathsf{T}}}{{\mathbf{y}}}{}+{{\mathbf{y}}}{}^{{\mathsf{T}}}{{\mathbf{y}}}{}) \\ & = & \nabla_{\mathbf{w}}({\mathbf{w}}^{{\mathsf{T}}}X^{{\mathsf{T}}}X{\mathbf{w}}-2{{\mathbf{y}}}^{{\mathsf{T}}}X{\mathbf{w}}+{{\mathbf{y}}}{}^{{\mathsf{T}}}{{\mathbf{y}}}{}) \\ & = & 2 X^{{\mathsf{T}}}X {\mathbf{w}}- 2 X^{{\mathsf{T}}}{{\mathbf{y}}} \end{aligned}\]
Setting gradient equal to zero: \[\begin{aligned} 2 X^{{\mathsf{T}}}X {\mathbf{w}}- 2 X^{{\mathsf{T}}}{{\mathbf{y}}}{} & = & 0 \\ \Rightarrow X^{{\mathsf{T}}}X {\mathbf{w}}& = & X^{{\mathsf{T}}}{{\mathbf{y}}}{} \\ \Rightarrow {\mathbf{w}}= (X^{{\mathsf{T}}}X)^{-1}X^{{\mathsf{T}}}{{\mathbf{y}}}{}\end{aligned}\]
The inverse exists if the columns of \(X\) are linearly independent.
|
\(h_{\bf w} ({\bf x}) = 1.06 + 1.61 x_1\) |
\[X=\left[\begin{array}{rr} 1 & 0.86 \\ 1 & 0.09 \\ 1 & -0.85 \\ 1 & 0.87 \\ 1 & -0.44 \\ 1 & -0.43 \\ 1 & -1.10 \\ 1 & 0.40 \\ 1 & -0.96 \\ 1 & 0.17 \end{array}\right]~~~~~{{\mathbf{y}}}{}=\left[\begin{array}{r} 2.49 \\ 0.83 \\ -0.25 \\ 3.10 \\ 0.87 \\ 0.02 \\ -0.12 \\ 1.81 \\ -0.83 \\ 0.43 \end{array}\right]\]
\[X^{{\mathsf{T}}}X =\] \[{\tiny \left[\begin{array}{rrrrrrrrrr} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0.86 & 0.09 & -0.85 & 0.87 & -0.44 & -0.43 & -1.10 & 0.40 & -0.96 & 0.17 \end{array}\right]\times\left[\begin{array}{cc} 1 & 0.86 \\ 1 & 0.09 \\ 1 & -0.85 \\ 1 & 0.87 \\ 1 & -0.44 \\ 1 & -0.43 \\ 1 & -1.10 \\ 1 & 0.40 \\ 1 & -0.96 \\ 1 & 0.17 \end{array}\right]}\] \[=\left[\begin{array}{rr} 10 & -1.39 \\ -1.39 & 4.95 \end{array}\right]\]
\[X^{{\mathsf{T}}}{{\mathbf{y}}}{}=\] \[{\tiny \left[\begin{array}{rrrrrrrrrr} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0.86 & 0.09 & -0.85 & 0.87 & -0.44 & -0.43 & -1.10 & 0.40 & -0.96 & 0.17 \end{array}\right]\times\left[\begin{array}{c} 2.49 \\ 0.83 \\ -0.25 \\ 3.10 \\ 0.87 \\ 0.02 \\ -0.12 \\ 1.81 \\ -0.83 \\ 0.43 \end{array}\right]}\] \[=\left[\begin{array}{r} 8.34 \\ 6.49 \end{array}\right]\]
\[{\bf w}=(X^{{\mathsf{T}}}X)^{-1}X^{{\mathsf{T}}}{{\mathbf{y}}}{} = \left[\begin{array}{rr} 10 & -1.39 \\ -1.39 & 4.95 \end{array}\right]^{-1}\left[\begin{array}{r} 8.34 \\ 6.49 \end{array}\right] = \left[\begin{array}{r} 1.06 \\ 1.61 \end{array}\right]\]
So the best fit line is \(y=1.06 + 1.61x\).
The optimal solution (minimizing sum-squared-error) can be computed in polynomial time in the size of the data set.
The solution is \({\bf w}=(X^{{\mathsf{T}}}X)^{-1}X^{{\mathsf{T}}}{{\mathbf{y}}}{}\), where \(X\) is the data matrix augmented with a column of ones, and \({{\mathbf{y}}}{}\) is the column vector of target outputs.
A very rare case in which an analytical, exact solution is possible
Linear regression should be the first thing you try for real-valued outputs!
…but it is sometimes not expressive enough.
Two possible solutions:
Explicitly transform the data, i.e. create additional features
Add cross-terms, higher-order terms
More generally, apply a transformation of the inputs from \({\cal X}\) to some other space \({\cal X}'\), then do linear regression in the transformed space
Use a different model space/hypothesis class
Idea (1) and idea (2) are two views of the strategy. Today we focus on the first approach
Suppose we want to fit a higher-degree polynomial to the data.
(E.g., \(y=w_0 + w_1x_1+w_2 x_1^2\).)
Suppose for now that there is a single input variable \(x_{i,1}\) per training sample.
How do we do it?
Given data: \((x_{1,1},y_1), (x_{1,2},y_2), \ldots, (x_{1,n},y_n)\).
Suppose we want a degree-\(d\) polynomial fit.
Let \({{\mathbf{y}}}{}\) be as before and let \[X=\left[\begin{array}{rrrrr} 1 & x_{1,1} & x_{1,1}^2 & \ldots & x_{1,1}^d \\ 1 & x_{1,2} & x_{1,2}^2 & \ldots & x_{1,2}^d \\ \vdots & & \vdots & \vdots & \vdots \\ 1 & x_{1,n} & x_{1,n}^2 & \ldots & x_{1,n}^d \\ \end{array}\right]\]
We are making up features to add to our design matrix
Solve the linear regression \(X{\bf w}\approx {{\mathbf{y}}}{}\).
\[X=\left[\begin{array}{rrr} 1 & 0.86 & 0.75 \\ 1 & 0.09 & 0.01 \\ 1 & -0.85 & 0.73 \\ 1 & 0.87 & 0.76 \\ 1 & -0.44 & 0.19 \\ 1 & -0.43 & 0.18 \\ 1 & -1.10 & 1.22 \\ 1 & 0.40 & 0.16 \\ 1 & -0.96 & 0.93 \\ 1 & 0.17 & 0.03 \end{array}\right]~~~~~{{\mathbf{y}}}{}=\left[\begin{array}{r} 2.49 \\ 0.83 \\ -0.25 \\ 3.10 \\ 0.87 \\ 0.02 \\ -0.12 \\ 1.81 \\ -0.83 \\ 0.43 \end{array}\right]\]
\[X^{{\mathsf{T}}}X =\] \[{\tiny \hspace{-0.3in}\left[\begin{array}{rrrrrrrrrr} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0.86 & 0.09 & -0.85 & 0.87 & -0.44 & -0.43 & -1.10 & 0.40 & -0.96 & 0.17 \\ 0.75 & 0.01 & 0.73 & 0.76 & 0.19 & 0.18 & 1.22 & 0.16 & 0.93 & 0.03 \end{array}\right]\times\left[\begin{array}{rrr} 1 & 0.86 & 0.75 \\ 1 & 0.09 & 0.01 \\ 1 & -0.85 & 0.73 \\ 1 & 0.87 & 0.76 \\ 1 & -0.44 & 0.19 \\ 1 & -0.43 & 0.18 \\ 1 & -1.10 & 1.22 \\ 1 & 0.40 & 0.16 \\ 1 & -0.96 & 0.93 \\ 1 & 0.17 & 0.03 \end{array}\right]}\] \[=\left[\begin{array}{rrr} 10 & -1.39 & 4.95 \\ -1.39 & 4.95 & 1.64 \\ 4.95 & 1.64 & 4.11 \end{array}\right]\]
\[X^{{\mathsf{T}}}{{\mathbf{y}}}{}=\] \[{\tiny \hspace{-0.3in}\left[\begin{array}{rrrrrrrrrr} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0.86 & 0.09 & -0.85 & 0.87 & -0.44 & -0.43 & -1.10 & 0.40 & -0.96 & 0.17 \\ 0.75 & 0.01 & 0.73 & 0.76 & 0.19 & 0.18 & 1.22 & 0.16 & 0.93 & 0.03\\ \end{array}\right]\times\left[\begin{array}{r} 2.49 \\ 0.83 \\ -0.25 \\ 3.10 \\ 0.87 \\ 0.02 \\ -0.12 \\ 1.81 \\ -0.83 \\ 0.43 \end{array}\right]}\] \[=\left[\begin{array}{r} 8.34 \\ 6.49 \\ 3.60 \end{array}\right]\]
\[{\bf w}=(X^{{\mathsf{T}}}X)^{-1}X^{{\mathsf{T}}}{{\mathbf{y}}}{} = {\tiny \left[\begin{array}{rrr} 10 & -1.39 & 4.95 \\ -1.39 & 4.95 & 1.64 \\ 4.95 & 1.64 & 4.11 \\ \end{array}\right]^{-1}\left[\begin{array}{r} 3.60 \\ 6.49 \\ 8.34 \end{array}\right] = \left[\begin{array}{r} 0.74 \\ 1.75 \\ 0.69 \end{array}\right]}\]
So the best order-2 polynomial is \(y=0.74 + 1.75 x + 0.69x^2\).
## (Intercept) x ## 1.1 1.6
## (Intercept) x I(x^2) ## 0.74 1.75 0.69
Is this a better fit to the data?
## (Intercept) x I(x^2) I(x^3) ## 0.71 1.39 0.80 0.46
Is this a better fit to the data?
## (Intercept) x I(x^2) I(x^3) I(x^4) ## 0.795 1.128 -0.039 0.905 0.898
Is this a better fit to the data?
## (Intercept) x I(x^2) I(x^3) I(x^4) I(x^5) ## 0.47 0.62 4.86 6.75 -5.25 -6.72
Is this a better fit to the data?
## (Intercept) x I(x^2) I(x^3) I(x^4) I(x^5) I(x^6) ## 0.13 3.13 8.99 -11.11 -23.83 12.52 18.38
Is this a better fit to the data?
## (Intercept) x I(x^2) I(x^3) I(x^4) I(x^5) I(x^6) I(x^7) ## 0.096 3.207 10.193 -11.078 -30.742 8.263 25.527 5.483
Is this a better fit to the data?
## (Intercept) x I(x^2) I(x^3) I(x^4) I(x^5) I(x^6) I(x^7) I(x^8) ## 1.3 -5.9 -5.1 69.9 48.8 -172.0 -131.9 123.3 101.2
Is this a better fit to the data?
## (Intercept) x I(x^2) I(x^3) I(x^4) I(x^5) I(x^6) I(x^7) I(x^8) I(x^9) ## -1.1 34.8 -127.9 -379.9 1186.9 1604.8 -2475.4 -2627.6 1499.6 1448.1
Is this a better fit to the data?
Which do you prefer and why?
Assume data \(({{{\mathbf{x}}}},y)\) are drawn from some fixed distribution
Given a model \(h\), (which could have come from anywhere), its generalization error is: \[J^*_h = E[L(h({{\bf X}}),Y)]\]
Given a set of data points from the same distribution, we can compute the empirical error \[\hat{J}^*_h = \frac{1}{n}\sum_{i=1}^m L(h({{{\mathbf{x}}}}_i),y_i)\]
\(\hat{J}^*_h\) is an unbiased estimate of \(J^*_h\) so long as the data did not influence the choice of \(h\).
Can use \(\hat{J}^*_h\) with CLT or bootstrap to get a C.I. for \(J^*_h\).
\[\hat{J}^*_h = \frac{1}{n}\sum_{i=1}^n L(h({{{\mathbf{x}}}}_i),y_i)\]
\(\hat{J}^*_h\) is an unbiased estimate of \(J^*_h\) so long as the \(({{{\mathbf{x}}}}_i,y_i)\) do not influence \(h\). Can use \(\hat{J}^*_h\) to get a confidence interval for \(J^*_h\).
Gives a strong statistical guarantee about the true performance of our system, if we didn’t use the test data to choose \(h\).
We can write "training error" for model class \(\mathcal{H}\) on a given data set as
\[\hat{J}_\mathcal{H} = \min_{h' \in \mathcal{H}} \frac{1}{n}\sum_{i=1}^n L(h'({{{\mathbf{x}}}}_i),y_i)\] - Let the corresponding learned hypothesis be
\[h^* = \arg\min_{h' \in \mathcal{H}} \frac{1}{n}\sum_{i=1}^n L(h'({{{\mathbf{x}}}}_i),y_i)\]
We would like to estimate the generalization error of our resulting predictor.
We would like to choose the best model space (e.g. linear, quadratic, …)
Training error \(\hat{J}_\mathcal{H}\) systematically underestimates generalization error \(J^*_{h^*}\) for the learned hypothesis \(h^*\).
The more complex the model, the smaller the training error.
|
Smaller training error does not mean smaller generalization error.
Suppose \(\mathcal{H}_1\) is the space of all linear functions, \(\mathcal{H}_2\) is the space of all quadratic functions. Note \(\mathcal{H}_1 \subset \mathcal{H}_2\).
Fix a data set.
Let \(h^*_1 = \arg\min_{h' \in \mathcal{H}_1} \hat{J}^*_{h'}\) and \(h^*_2 = \arg\min_{h' \in \mathcal{H}_2} \hat{J}^*_{h'}\), both computed using the same dataset.
We must have \(\hat{J}^*_{h^*_2} \le \hat{J}^*_{h^*_1}\), but we may have \(J^*_{h_2} > J^*_{h_1}\).
Training error is no good for choosing the model space.
Training error \(\hat{J}_\mathcal{H}\) underestimates generalization error \(J^*_h\)
If you really want a good estimate of \(J^*_h\), you need a separate test set
(But new stat methods can produce a CI using training error)
Could report test error, then deploy whatever you train on the whole data. (Probably won’t be worse.)
Smaller training error does not mean smaller generalization error.
Small training error, large generalization error is known as overfitting
A general procedure for estimating the true error of a specific learned model using model selection
The data is randomly partitioned into three disjoint subsets:
A training set used only to find the parameters \({\bf w}\)
A validation set used to find the right model space (e.g., the degree of the polynomial)
A test set used to estimate the generalization error of the resulting model
Can generate standard confidence intervals for the generalization error of the learned model
Pros:
Cons:
Smaller effective training sets make performance more variable.
Small validation sets can give poor model selection
Small test sets can give poor estimates of performance
For a test set of size 100, with 60 correct classifications, 95% C.I. for actual accuracy is \((0.497,0.698)\).
Divide the instances into \(k\) disjoint partitions or folds of size \(n/k\)
Loop through the partitions \(i = 1 ... k\):
Partition \(i\) is for evaluation (i.e., estimating the performance of the algorithm after learning is done)
The rest are used for training (i.e., choosing the specific model within the space)
"Cross-Validation Error" is the average error on the evaluation partitions. Has lower variance than error on one partition.
This is the main CV idea; CV is used for different purposes though.
\(n=50\) examples, binary classification, balanced classes
\(p = 5000\) features, all statistically independent of \(y\)
Use model selection to find best \(100\) features by correlation on entire dataset.
Use cross-validation with these \(p = 100\) to estimate error.
CV-based error rate was 3%.
Divide the instances into \(k\) folds of size \(n/k\).
Loop over \(m\) model spaces \(1 ... m\)
Loop over the \(k\) folds \(i = 1 ... k\):
Fold \(i\) is for validation (i.e., estimating the performance of the algorithm after learning is done)
The rest are used for training (i.e., choosing the specific model within the space)
For each model space, report average error over folds, and standard error.
Divide the instances into \(k\) "outer" folds of size \(n/k\).
Loop over \(m\) model spaces \(1 ... m\)
Use average error over folds and SE to choose model space.
Train on all inner folds.
Test the model on outer test fold
Minimum-CV Estimate: 128.48, Nested CV Estimate: 149.91
The training error decreases with the degree of the polynomial \(M\), i.e. the complexity (size) of the model space
Generalization error decreases at first, then starts increasing
Set aside a validation set helps us find a good model space
We then can report unbiased error estimate, using a test set, untouched during both parameter training and validation
Cross-validation is a lower-variance but possibly biased version of this approach. It is standard.
If you have lots of data, just use held-out validation and test sets.