To date, we have focussed on accuracy: How often is my classifier correct on new data?
Depending on how the classifier will be applied, however, other measures may be more appropriate.
Compute the proportion that were correctly or incorrectly classified.
\[\mathrm{Accuracy} = n^{-1} \sum_{i=1}^n 1(\hat{y}_i = y_i)\]
\[\mathrm{Error Rate} = n^{-1} \sum_{i=1}^n 1(\hat{y}_i \ne y_i)\]
Literature on learning from unbalanced classes:
http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5299216 http://link.springer.com/article/10.1007%2Fs10115-013-0670-6 http://www.computer.org/csdl/proceedings/icdm/2012/4905/00/4905a695-abs.html http://www.computer.org/csdl/proceedings/icdm/2011/4408/00/4408a754-abs.html
npos <- 500; nneg <- 500; set.seed(1)
df <- rbind(data.frame(x=rnorm(npos,mupos), y=1),data.frame(x=rnorm(nneg,muneg),y=-1)); df$y <- as.factor(df$y)
sep <- tune(svm,y~x,data=df,ranges=list(gamma = 2^(-1:1), cost = 2^(2:4)))
df$ypred <- predict(sep$best.model)
ggplot(df,aes(x=x,fill=y)) + geom_histogram(alpha=0.2,position="identity",bins=51) + geom_point(aes(y=ypred,colour=ypred)) + scale_color_discrete(drop=FALSE)
npos <- 50; nneg <- 950; set.seed(1)
df <- rbind(data.frame(x=rnorm(npos,mupos), y=1),data.frame(x=rnorm(nneg,muneg),y=-1)); df$y <- as.factor(df$y)
rsep <- tune(svm,y~x,data=df,ranges=list(gamma = 2^(-1:1), cost = 2^(2:4)))
df$ypred <- predict(rsep$best.model);
ggplot(df,aes(x=x,fill=y)) + geom_histogram(alpha=0.2,position="identity",bins=51) + geom_point(aes(y=ypred,colour=ypred)) + scale_color_discrete(drop=FALSE)
newneg <- df %>% filter(y == 1) %>% sample_n(900,replace=T); dfupsamp <- rbind(df,newneg)
upsep <- tune(svm,y~x,data=dfupsamp,ranges=list(gamma = 2^(-1:1), cost = 2^(2:4)))
dfupsamp$ypred <- predict(upsep$best.model);
ggplot(dfupsamp,aes(x=x,fill=y)) + geom_histogram(alpha=0.2,position="identity",bins=51) + geom_point(aes(y=ypred,colour=ypred)) + scale_color_discrete(drop=FALSE)
df$upsampred <- predict(upsep$best.model,df)
mean(df$y == df$ypred)
## [1] 0.95
mean(df$y == df$upsampred)
## [1] 0.694
No upsampling: 95% Accuracy
Upsampled: 68% Accuracy
So why do you like the upsampled classifier better?
True class | |||
Total population | class positive | class negative | |
Predicted class |
Predicted class positive |
True positive |
False positive (Type I error) |
Predicted class negative |
False negative (Type II error) |
True negative |
Careful! A “false positive” is actually a negative and a “false negative” is actually a positive.
Precision = Σ True positiveΣ Predicted positive |
Recall = Σ True positiveΣ Class positive |
\[\mbox{F-measure} = 2 \frac{\mathrm{Precision}\cdot\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}\] https://en.wikipedia.org/wiki/F1_score
\[\mbox{F-measure} = 2 \frac{\mathrm{Precision}\cdot\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}\] https://en.wikipedia.org/wiki/F1_score
For the “always predict -1” classifier, recall = 0, precision = 0, so
F-measure = 0.
For the classifier learned from up-sampled data,
prec <- sum(df$y == 1 & df$upsampred == 1) / sum(df$upsampred == 1)
recall <- sum(df$y == 1 & df$upsampred == 1) / sum(df$y == 1)
F1.upsamp <- 2 * prec*recall / (prec + recall)
print(F1.upsamp)
## [1] 0.2072539
NOTE that F-measure is not “symmetric”; it depends on the definition of the positive class. Typically used when positive class is rare but important e.g. information retrieval.
Sensitivity = Σ True positiveΣ Class positive |
Specificity = Σ True negativeΣ Class negative |
\[\mbox{BalancedAccuracy} = \frac{1}{2} (\mathrm{Sensitivity}+\mathrm{Specificity})\]
https://en.wikipedia.org/wiki/Accuracy_and_precision#In_binary_classification
Note: Sensitivity is same as Recall.
\[\mbox{BalancedAccuracy} = \frac{1}{2} (\mathrm{Sensitivity}+\mathrm{Specificity})\]
For “always predict -1” classifier, sensitivity = 0, specificity = 1, balanced accuracy = 0.5.
For the classifier learned from up-sampled data,
sens <- sum(df$y == 1 & df$upsampred == 1) / sum(df$y == 1)
spec <- sum(df$y == -1 & df$upsampred == -1) / sum(df$y == -1)
bal.acc.upsamp <- 0.5*(sens + spec)
print(bal.acc.upsamp)
## [1] 0.7442105
https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers
True class | |||||
Total population | class positive | class negative | Prevalence = Σ class positiveΣ Total population | ||
Predicted class |
Predicted class positive |
True positive |
False positive (Type I error) |
Positive predictive value (PPV), Precision = Σ True positiveΣ Predicted positive | False discovery rate (FDR) = Σ False positiveΣ Predicted positive |
Predicted class negative |
False negative (Type II error) |
True negative | False omission rate (FOR) = Σ False negativeΣ Predicted negative | Negative predictive value (NPV) = Σ True negativeΣ Predicted negative | |
Accuracy (ACC) = Σ True positive + Σ True negativeΣ Total population |
True positive rate (TPR), Sensitivity, Recall = Σ True positiveΣ Class positive | False positive rate (FPR), Fall-out = Σ False positiveΣ Class negative | Positive likelihood ratio (LR+) = TPRFPR | Diagnostic odds ratio (DOR) = LR+LR− | |
False negative rate (FNR), Miss rate = Σ False negativeΣ Class positive | True negative rate (TNR), Specificity (SPC) = Σ True negativeΣ Class negative | Negative likelihood ratio (LR−) = FNRTNR |
Sensitivity = Σ True positiveΣ Class positive |
Specificity = Σ True negativeΣ Class negative |
Recall:
\[\mbox{BalancedAccuracy} = \frac{1}{2} (\mathrm{Sensitivity}+\mathrm{Specificity})\]
What if e.g. false positives are more costly than false negatives?
Let \(\mathrm{P}\) and \(\mathrm{N}\) the proportions of positives and negatives in the population.
\[\mathrm{FNRate} = (1 - \mathrm{Sensitivity}), \mathrm{FPRate} = (1 - \mathrm{Specificity})\]
\[\mbox{NormExpectedCost} = c_{\mathrm{FP}}\cdot\mathrm{FPRate}\cdot\mathrm{P}+ c_{\mathrm{FN}}\cdot\mathrm{FNRate}\cdot\mathrm{N}\]
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
Think: “If I fix FPR at 0.4, what is my TPR?”
Obviously, higher is better. Random guessing gives an ROC curve along \(y = x\).
If the area under the curve (AUC) is 1, we have a perfect classifier. AUC of 0.5 is pretty bad.
Very common measure of classifier performance, especially when classes are imbalanced.
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
preds <- attr(predict(upsep$best.model,df,decision.values = TRUE),"decision.values")
plot(performance(prediction(preds,df$y),"tpr","fpr"))
If we care about all these measures, why do we optimize misclassification rate, or margin, or likelihood?
Classifier learned the way we described often perform well measures presented here
\[ \mathrm{MSE} = n^{-1} \sum_{i=1}^n (\hat y_i - y_i)^2\]
\[ \mathrm{RMSE} = \sqrt{ n^{-1} \sum_{i=1}^n (\hat y_i - y_i)^2 }\]
\[ \mathrm{MAE} = n^{-1} \sum_{i=1}^n |\hat y_i - y_i|\]
I find MAE easier to interpret. (How far am I from the correct value, on average?) RMSE is at least in the same units as the \(y\).
\[ \mathrm{MRE} = n^{-1} \sum_{i=1}^n \frac{|\hat y_i - y_i|}{|y_i|}\]
Scales error according to magnitude of true \(y\). E.g., if MRE=\(0.2\), then regression is wrong by 20% of the value of \(y\), on average.
If this is appropriate for your problem then linear regression, which assumes additive error, may not be appropriate. Options include using a different model or regression on \(\log y\) rather than on \(y\).
https://en.wikipedia.org/wiki/Approximation_error#Formal_Definition