We would like to estimate the*Performance:***generalization error**of our resulting predictor.We would like to choose the best model space (e.g. linear, quadratic, …)*Model selection:***for the data we have**

2017-09-25

Model Selection

We would like to estimate the*Performance:***generalization error**of our resulting predictor.We would like to choose the best model space (e.g. linear, quadratic, …)*Model selection:***for the data we have**

Choose model class

Find the model in the class that gives the minimum training error.

But we saw previously that ** generalization error** is what we really want to minimize.

And picking the wrong model class can be catastrophic.

The best model space is not the simplest nor the most complex.

Larger model spaces *always* lead to lower training error.

Suppose \(\mathcal{H}_1\) is the space of all linear functions, \(\mathcal{H}_2\) is the space of all quadratic functions. Note \(\mathcal{H}_1 \subset \mathcal{H}_2\).

Fix a data set.

Let \(h^*_1 = \arg\min_{h' \in \mathcal{H}_1} \frac{1}{n}\sum_{i=1}^n L(h'({{{\mathbf{x}}}}_i),y_i)\) and \(h^*_2 = \arg\min_{h' \in \mathcal{H}_2} \frac{1}{n}\sum_{i=1}^n L(h'({{{\mathbf{x}}}}_i),y_i)\), both computed using the same dataset.

It

**must**be the case that \(\min_{h' \in \mathcal{H}_2} \frac{1}{n}\sum_{i=1}^n L(h'({{{\mathbf{x}}}}_i),y_i) \le \min_{h' \in \mathcal{H}_1} \frac{1}{n}\sum_{i=1}^n L(h'({{{\mathbf{x}}}}_i),y_i)\),Small training error, large generalization error is known as

**overfitting**

- A separate
**validation set**can be used for model selection.- Train on the training set using each proposed model space
- Evaluate each on the validation set, identify the one with lowest
*validation*error - Choose the simplest model with performance < 1 std. error worse than the best.

Experimental Scenario

- Generate training data, validation data
- Choose best model using validation data as per above
- Estimate performance of best model using validation data

Will this produce an unbiased estimate of generalization error?

A general procedure for doing model selection and performance evaluation

The data is randomly partitioned into three disjoint subsets:

A

*training set*used only to find the parameters \({\bf w}\)A

*validation set*used to find the right model space (e.g., the degree of the polynomial)A

*test set*used to estimate the generalization error of the resulting model

Can generate standard confidence intervals for the generalization error of the learned model

Pros:

Measures what we want: Performance of the actual learned model.

Simple

Cons:

Smaller effective training sets make performance and performance estimates more variable.

Small validation sets can give poor model selection

Small test sets can give poor estimates of performance

For a test set of size 100, with 60 correct classifications, 95% C.I. for actual accuracy is \((0.497,0.698)\).

(HTF 7.10, JWHT 5.1)

Divide the instances into \(k\) disjoint partitions or folds of size \(n/k\)

Loop through the partitions \(i = 1 ... k\):

Partition \(i\) is for evaluation (i.e., estimating the performance of the algorithm after learning is done)

The rest are used for training (i.e., choosing the specific model within the space)

*"Cross-Validation Error" is the average error on the evaluation partitions*. Has lower variance than error on one partition.This is the main CV

**idea**; CV is used for different purposes though.

(HTF 7.10, JWHT 5.1)

Divide the instances into \(k\) folds of size \(n/k\).

Loop over \(m\) model spaces \(1 ... m\)

Loop over the \(k\) folds \(i = 1 ... k\):

Fold \(i\) is for validation (i.e., estimating the performance of the algorithm after learning is done)

The rest are used for training (i.e., choosing the specific model within the space)

For each model space, report average error over folds, and standard error.