Performance: We would like to estimate the generalization error of our resulting predictor.
Model selection: We would like to choose the best model space (e.g. linear, quadratic, …) for the data we have
2018-10-02
Performance: We would like to estimate the generalization error of our resulting predictor.
Model selection: We would like to choose the best model space (e.g. linear, quadratic, …) for the data we have
Choose model class
Find the model in the class that gives the minimum training error.
But we saw previously that generalization error is what we really want to minimize.
And picking the wrong model class can be catastrophic.
The best model space is not the simplest nor the most complex.
Larger model spaces always lead to lower training error.
Suppose \(\mathcal{H}_1\) is the space of all linear functions, \(\mathcal{H}_2\) is the space of all quadratic functions. Note \(\mathcal{H}_1 \subset \mathcal{H}_2\).
Fix a data set.
Let \(h^*_1 = \arg\min_{h' \in \mathcal{H}_1} \frac{1}{n}\sum_{i=1}^n L(h'({{\mathbf{x}}}_i),y_i)\) and \(h^*_2 = \arg\min_{h' \in \mathcal{H}_2} \frac{1}{n}\sum_{i=1}^n L(h'({{\mathbf{x}}}_i),y_i)\), both computed using the same dataset.
It must be the case that \(\min_{h' \in \mathcal{H}_2} \frac{1}{n}\sum_{i=1}^n L(h'({{\mathbf{x}}}_i),y_i) \le \min_{h' \in \mathcal{H}_1} \frac{1}{n}\sum_{i=1}^n L(h'({{\mathbf{x}}}_i),y_i)\),
Small training error, large generalization error is known as overfitting
Experimental Scenario
Will this produce an unbiased estimate of generalization error?
A general procedure for doing model selection and performance evaluation
The data is randomly partitioned into three disjoint subsets:
A training set used only to find the parameters \({\bf w}\)
A validation set used to find the right model space (e.g., the degree of the polynomial)
A test set used to estimate the generalization error of the resulting model
Can generate standard confidence intervals for the generalization error of the learned model
Pros:
Measures what we want: Performance of the actual learned model.
Simple
Cons:
Smaller effective training sets make performance and performance estimates more variable.
Small validation sets can give poor model selection
Small test sets can give poor estimates of performance
For a test set of size 100, with 60 correct classifications, 95% C.I. for actual accuracy is \((0.497,0.698)\).
Divide the instances into \(k\) disjoint partitions or folds of size \(n/k\)
Loop through the partitions \(i = 1 ... k\):
Partition \(i\) is for evaluation (i.e., estimating the performance of the algorithm after learning is done)
The rest are used for training (i.e., choosing the specific model within the space)
“Cross-Validation Error” is the average error on the evaluation partitions. Has lower variance than error on one partition.
This is the main CV idea; CV is used for different purposes though.
Divide the instances into \(k\) folds of size \(n/k\).
Loop over \(m\) model spaces \(1 ... m\)
Loop over the \(k\) folds \(i = 1 ... k\):
Fold \(i\) is for validation (i.e., estimating the performance of the algorithm after learning is done)
The rest are used for training (i.e., choosing the specific model within the space)
For each model space, report average error over folds, and standard error.
Divide the instances into \(k\) “outer” folds of size \(n/k\).
Loop over \(m\) model spaces \(1 ... m\)
Use average error over folds and SE to choose model space.
Train on all inner folds.
Test the model on outer test fold
Minimum-CV Estimate: 128.48, Nested CV Estimate: 149.91
The training error decreases with the complexity (size) of the model space
Generalization error decreases at first, then starts increasing
Set aside a validation set helps us find a good model space
We then can report unbiased error estimate, using a test set, untouched during both parameter training and validation
Cross-validation is a lower-variance but possibly biased version of this approach. It is standard.
If you have lots of data, just use held-out validation and test sets.
Training error is biased downward.
For simple models, including linear ones, we can get a less-biased estimate of generalization error by adjusting the training error upwards. These adjusted training error estimators include: \(C_p\), AIC, and BIC
If you have a limited amount of data, and you are using linear models, and you want to do model selection, you may want to use one of these instead of a validation set.
See JWHT 6.1.3 for examples.
Instead of \(h(\mathbf{x})\), we write \(h(\mathbf{\phi}(\mathbf{x}))\), or \(h(\phi_1(\mathbf{x}), \phi_2(\mathbf{x}), ... \phi_k(\mathbf{x})\).
where \(\phi_k\) are sometimes called feature functions or basis functions that define new features in terms of what we might call the “raw data”
Basis functions are fixed for training (but can be chosen through model selection)
Goal: omit features that are not helpful for prediction.
Forward selection: Start with no features, try adding each one, measure performance. (How?) Keep the best such model, repeat until all features are included.
Backward selection: Start with all features, try removing each one (separately), keep the best model that has had one feature removed. Repeat until no features are included.
For both of these, \(p+1\) models are created if there are \(p\) features. Choose your preferred model. (How?)
Criterion for “performance” is important. For predictive modelling, should be a predictive criterion like average loss.
Forward and backward selection can be used with any supervised learning method. (Sometimes called wrapper methods.)
There are other feature selection techniques that are specific to individual learning methods.