2016-04-06
To date, we have focussed on accuracy: How often is my classifier correct on new data?
Depending on how the classifier will be applied, however, other measures may be more appropriate.
Compute the proportion that were correctly or incorrectly classified.
\[\mathrm{Accuracy} = n^{-1} \sum_{i=1}^n 1(\hat{y}_i = y_i)\]
\[\mathrm{Error Rate} = n^{-1} \sum_{i=1}^n 1(\hat{y}_i \ne y_i)\]
Literature on learning from unbalanced classes:
http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5299216 http://link.springer.com/article/10.1007%2Fs10115-013-0670-6 http://www.computer.org/csdl/proceedings/icdm/2012/4905/00/4905a695-abs.html http://www.computer.org/csdl/proceedings/icdm/2011/4408/00/4408a754-abs.html
npos <- 500; nneg <- 500; set.seed(1) df <- rbind(data.frame(x=rnorm(npos,mupos), y=1),data.frame(x=rnorm(nneg,muneg),y=-1)); df$y <- as.factor(df$y) sep <- tune(svm,y~x,data=df,ranges=list(gamma = 2^(-1:1), cost = 2^(2:4))) df$ypred <- predict(sep$best.model) ggplot(df,aes(x=x,fill=y)) + geom_histogram(alpha=0.2,position="identity",bins=51) + geom_point(aes(y=ypred,colour=ypred)) + scale_color_discrete(drop=FALSE)
npos <- 50; nneg <- 950; set.seed(1) df <- rbind(data.frame(x=rnorm(npos,mupos), y=1),data.frame(x=rnorm(nneg,muneg),y=-1)); df$y <- as.factor(df$y) sep <- tune(svm,y~x,data=df,ranges=list(gamma = 2^(-1:1), cost = 2^(2:4))) df$ypred <- predict(sep$best.model); ggplot(df,aes(x=x,fill=y)) + geom_histogram(alpha=0.2,position="identity",bins=51) + geom_point(aes(y=ypred,colour=ypred)) + scale_color_discrete(drop=FALSE)
newneg <- df %>% filter(y == 1) %>% sample_n(900,replace=T); dfupsamp <- rbind(df,newneg) sep <- tune(svm,y~x,data=dfupsamp,ranges=list(gamma = 2^(-1:1), cost = 2^(2:4))) dfupsamp$ypred <- predict(sep$best.model); ggplot(dfupsamp,aes(x=x,fill=y)) + geom_histogram(alpha=0.2,position="identity",bins=51) + geom_point(aes(y=ypred,colour=ypred)) + scale_color_discrete(drop=FALSE)
df$upsampred <- predict(sep$best.model,df) mean(df$y == df$ypred)
## [1] 0.95
mean(df$y == df$upsampred)
## [1] 0.684
No upsampling: 95% Accuracy
Upsampled: 70% Accuracy
So why do you like the upsampled classifier better?
True class | |||
Total population | class positive | class negative | |
Predicted class |
Predicted class positive |
True positive |
False positive (Type I error) |
Predicted class negative |
False negative (Type II error) |
True negative |
Careful! A "false positive" is actually a negative and a "false negative" is actually a positive.
Precision
= Σ True positive Σ Predicted positive |
Recall
= Σ True positive Σ Class positive |
\[\mbox{F-measure} = 2 \frac{\mathrm{Precision}\cdot\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}\] https://en.wikipedia.org/wiki/F1_score
\[\mbox{F-measure} = 2 \frac{\mathrm{Precision}\cdot\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}\] https://en.wikipedia.org/wiki/F1_score
For the "always predict -1" classifier, recall = 0, precision = 0, so
F-measure = 0.
For the classifier learned from up-sampled data,
prec <- sum(df$y == 1 & df$upsampred == 1) / sum(df$upsampred == 1) recall <- sum(df$y == 1 & df$upsampred == 1) / sum(df$y == 1) F1.upsamp <- 2 * prec*recall / (prec + recall) print(F1.upsamp)
## [1] 0.2020202
NOTE that F-measure is not "symmetric"; it depends on the definition of the positive class. Typically used when positive class is rare but important to an application.
Sensitivity
= Σ True positive Σ Class positive |
Specificity
= Σ True negative Σ Class negative |
\[\mbox{BalancedAccuracy} = \frac{1}{2} (\mathrm{Sensitivity}+\mathrm{Specificity})\]
https://en.wikipedia.org/wiki/Accuracy_and_precision#In_binary_classification
Note: Sensitivity is same as Recall.
\[\mbox{BalancedAccuracy} = \frac{1}{2} (\mathrm{Sensitivity}+\mathrm{Specificity})\]
For "always predict -1" classifier, sensitivity = 0, specificity = 1, balanced accuracy = 0.5.
For the classifier learned from up-sampled data,
sens <- sum(df$y == 1 & df$upsampred == 1) / sum(df$y == 1) spec <- sum(df$y == -1 & df$upsampred == -1) / sum(df$y == -1) bal.acc.upsamp <- 0.5*(sens + spec) print(bal.acc.upsamp)
## [1] 0.7389474
https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers
True class | |||||
Total population | class positive | class negative | Prevalence = Σ class positive Σ Total population | ||
Predicted class |
Predicted class positive |
True positive |
False positive (Type I error) |
Positive predictive value (PPV), Precision = Σ True positive Σ Predicted positive | False discovery rate (FDR) = Σ False positive Σ Predicted positive |
Predicted class negative |
False negative (Type II error) |
True negative | False omission rate (FOR) = Σ False negative Σ Predicted negative | Negative predictive value (NPV) = Σ True negative Σ Predicted negative | |
Accuracy (ACC)
=
Σ True positive + Σ True negative Σ Total population |
True positive rate (TPR), Sensitivity, Recall = Σ True positive Σ Class positive | False positive rate (FPR), Fall-out = Σ False positive Σ Class negative | Positive likelihood ratio (LR+) = TPR FPR | Diagnostic odds ratio (DOR) = LR+ LR− | |
False negative rate (FNR), Miss rate = Σ False negative Σ Class positive | True negative rate (TNR), Specificity (SPC) = Σ True negative Σ Class negative | Negative likelihood ratio (LR−) = FNR TNR |
Sensitivity
= Σ True positive Σ Class positive |
Specificity
= Σ True negative Σ Class negative |
Recall:
\[\mbox{BalancedAccuracy} = \frac{1}{2} (\mathrm{Sensitivity}+\mathrm{Specificity})\]
What if e.g. false positives are more costly than false negatives?
Let \(\mathrm{P}\) and \(\mathrm{N}\) the proportions of positives and negatives in the population.
\[\mathrm{FNRate} = (1 - \mathrm{Sensitivity}), \mathrm{FPRate} = (1 - \mathrm{Specificity})\]
\[\mbox{NormExpectedCost} = c_{\mathrm{FP}}\cdot\mathrm{FPRate}\cdot\mathrm{P}+ c_{\mathrm{FN}}\cdot\mathrm{FNRate}\cdot\mathrm{N}\]
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
Think: "If I fix FPR at 0.4, what is my TPR?"
Obviously, higher is better. Random guessing gives an ROC curve along \(y = x\).
If the area under the curve (AUC) is 1, we have a perfect classifier. AUC of 0.5 is pretty bad.
Very common measure of classifier performance, especially when classes are imbalanced.
If we care about all these measures, why do we optimize misclassification rate, or margin, or likelihood?
Classifier learned the way we described often perform well measures presented here
\[ \mathrm{MSE} = n^{-1} \sum_{i=1}^n (\hat y_i - y_i)^2\]
\[ \mathrm{RMSE} = \sqrt{ n^{-1} \sum_{i=1}^n (\hat y_i - y_i)^2 }\]
\[ \mathrm{MAE} = n^{-1} \sum_{i=1}^n |\hat y_i - y_i|\]
I find MAE easier to interpret. (How far am I from the correct value, on average?) RMSE is at least in the same units as the \(y\).
\[ \mathrm{MRE} = n^{-1} \sum_{i=1}^n \frac{|\hat y_i - y_i|}{|y_i|}\]
Scales error according to magnitude of true \(y\). E.g., if MRE=\(0.2\), then regression is wrong by 20% of the value of \(y\), on average.
If this is appropriate for your problem then linear regression, which assumes additive error, may not be appropriate. Options include using a different model or regression on \(\log y\) rather than on \(y\).
https://en.wikipedia.org/wiki/Approximation_error#Formal_Definition