Which do you prefer and why?
2017-09-25
Which do you prefer and why?
Which do you prefer and why?
Define the loss (error) of the hypothesis on an example \(({\mathbf{x}}, y)\) as \[L(h({\mathbf{x}}) ,y)\]
Suppose \(({\bf X},Y)\) is a vector-valued random variable. Then what is \[L(h({{\bf X}}),Y)\]
Given a model \(h\), (which could have come from anywhere), its generalization error is: \[E[L(h({{\bf X}}),Y)]\]
Given a set of data points \(({\mathbf{x}}_i, y_i)\) that are realizations of \(({{\bf X}},Y)\), we can compute the empirical error \[\bar\ell_{h,n} = \frac{1}{n}\sum_{i=1}^n L(h({{{\mathbf{x}}}}_i),y_i)\]
What is \(\bar\ell_{h,n}\)?
\[ \bar{x}_n = \frac{1}{n} \sum_i x_i \]
Given a dataset, \(\bar x_n\) is a fixed number. We use \(\bar X_n\) to denote the random variable corresponding to the sample mean computed from a randomly drawn dataset of size \(n\).
Datasets of size \(n = 15\), sample means plotted in red.
A statistic is any summary of a dataset. (E.g. \(\bar x_n\), sample median.) A statistic is the result of a function applied to a dataset.
A parameter is any summary of the distribution of a random variable. (E.g. \(\mu_X\), median.) A parameter is the result of a function applied to a distribution.
Given an estimate, how good is it?
The distribution of an estimator is called its sampling distribution.
\[ E[\bar{X}_n - \mu_X] \]
\[ E[(\bar{X}_n - E[\bar{X}_n])^2] \]
\[ E[(\bar{X}_n - \mu_X)^2] \]
\[ f_{X}(x) = \frac{1}{\sigma_X\sqrt{2\pi}} \mathrm{e}^{-\frac{(x - \mu_X)^2}{2\sigma_X^2}} \]
Normal distribution is defined by two parameters: \(\mu_X, \sigma^2_X\).
The normal distribution is special (among other reasons) because many estimators have approximately normal sampling distributions or have sampling distributions that are closely related to the normal.
For an estimator like \(\bar{X}_n\), if we know \(\mu_{\bar{X}_n}\) and \(\sigma^2_{\bar{X}_n}\), then we can say a lot about how good it is.
\[ F_{\bar X_n}(\bar x) \approx \int_{-\infty}^{\bar x} \frac{1}{\sigma_n\sqrt{2\pi}} \mathrm{e}^{-\frac{(\bar x - \mu_X)^2}{2\sigma_n^2}}\]
where
\[ \sigma_n^2 = \frac{\sigma^2}{\sqrt{n}} \]
is called the standard error and \(\sigma^2\) is the variance of \(X\).
Eruptions dataset has \(n = 272\) observations.
Our estimate of the mean of eruption times is \(\bar x_{272}\) = 3.4877831.
What is the probability of observing an \({\bar{x}}_{272}\) that is within 10 seconds of the true mean?
By the C.L.T., \[\Pr(-0.17 \le {\bar{X}}_{272} - \mu_X \le 0.17) = \int_{x = -0.17}^{0.17} \frac{1}{\sqrt{2\pi \sigma_n}} \mathrm{e}^{-\frac{(x - \mu_X)^2}{2\sigma^2_n}}\]
\[= 0.986\]
Note! I estimated \(\sigma_X\) here. (Look up "\(t\)-test" for details.)
\[\int_{x = -0.17}^{0.17} \frac{1}{\sqrt{2\pi \sigma_n}} \mathrm{e}^{-\frac{(x - \mu_X)^2}{2\sigma^2_n}} = 0.986\]
Training error underestimates generalization error. It is a biased estimator.
If you really want a good estimate of generalization error, you need to hold out a separate test set of data not used for training.
Possibly of size \(n = (1.96)^2\frac{\sigma_L^2}{d^2}\) where \(\sigma_L^2\) is the variance of the loss (which has to be guessed or estimated from training) and \(d\) is half-width of a 95% confidence interval.
Could report test the error, but then deploy whatever you train on the whole data. (Probably won’t be worse.)
## [1] "Estimated variance of errors: 0.168326238718343"
## [1] "Sample required for CI width of 0.2 (+- 0.1): 65"
TestMSE | VarOfErrors | StdOfSquaredErrors | n | StandardError | CI_left | CI_right |
---|---|---|---|---|---|---|
0.2261605 | 0.0980595 | 0.3131445 | 65 | 0.0388408 | 0.1500326 | 0.3022885 |
\[ \mathrm{MSE} = n^{-1} \sum_{i=1}^n (\hat y_i - y_i)^2\]
\[ \mathrm{RMSE} = \sqrt{ n^{-1} \sum_{i=1}^n (\hat y_i - y_i)^2 }\]
\[ \mathrm{MAE} = n^{-1} \sum_{i=1}^n |\hat y_i - y_i|\]
I find MAE easier to interpret. (How far am I from the correct value, on average?) RMSE is at least in the same units as the \(y\).
\[ \mathrm{MRE} = n^{-1} \sum_{i=1}^n \frac{|\hat y_i - y_i|}{|y_i|}\]
Scales error according to magnitude of true \(y\). E.g., if MRE=\(0.2\), then regression is wrong by 20% of the value of \(y\), on average.
If this is appropriate for your problem then linear regression, which assumes additive error, may not be appropriate. Options include using a different model or regression on \(\log y\) rather than on \(y\).
https://en.wikipedia.org/wiki/Approximation_error#Formal_Definition
library(boot) bootstraps <- boot(faithful$eruptions,function(d,i){mean(d[i])},R=5000) bootdata = data.frame(xbars=bootstraps$t); limits = quantile(bootdata$xbars,c(0.025,0.975)) ggplot(bootdata, aes(x=xbars)) + labs(y="Prop.") + geom_histogram(aes(y = ..density..)) + geom_errorbarh(aes(xmin=limits[[1]], xmax=limits[[2]], y=c(0)),height=0.25,colour="red",size=2)