Compute the proportion that were correctly or incorrectly classified.
\[\mathrm{Accuracy} = n^{-1} \sum_{i=1}^n 1({\hat{y}}_i = y_i)\]
\[\mathrm{Error Rate} = n^{-1} \sum_{i=1}^n 1(\hat{y}_i \ne y_i)\]
2018-10-30
Compute the proportion that were correctly or incorrectly classified.
\[\mathrm{Accuracy} = n^{-1} \sum_{i=1}^n 1({\hat{y}}_i = y_i)\]
\[\mathrm{Error Rate} = n^{-1} \sum_{i=1}^n 1(\hat{y}_i \ne y_i)\]
Literature on learning from unbalanced classes:
http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5299216 http://link.springer.com/article/10.1007%2Fs10115-013-0670-6 http://www.computer.org/csdl/proceedings/icdm/2012/4905/00/4905a695-abs.html http://www.computer.org/csdl/proceedings/icdm/2011/4408/00/4408a754-abs.html
npos <- 500; nneg <- 500; set.seed(1) df <- rbind(data.frame(x=rnorm(npos,mupos), y=1),data.frame(x=rnorm(nneg,muneg),y=-1)); df$y <- as.factor(df$y) sep <- tune(svm,y~x,data=df,ranges=list(gamma = 2^(-1:1), cost = 2^(2:4))) df$ypred <- predict(sep$best.model) ggplot(df,aes(x=x,fill=y)) + geom_histogram(alpha=0.2,position="identity",bins=51) + geom_point(aes(y=ypred,colour=ypred)) + scale_color_discrete(drop=FALSE)
npos <- 50; nneg <- 950; set.seed(1) df <- rbind(data.frame(x=rnorm(npos,mupos), y=1),data.frame(x=rnorm(nneg,muneg),y=-1)); df$y <- as.factor(df$y) rsep <- tune(svm,y~x,data=df,ranges=list(gamma = 2^(-1:1), cost = 2^(2:4))) df$ypred <- predict(rsep$best.model); ggplot(df,aes(x=x,fill=y)) + geom_histogram(alpha=0.2,position="identity",bins=51) + geom_point(aes(y=ypred,colour=ypred)) + scale_color_discrete(drop=FALSE)
newneg <- df %>% filter(y == 1) %>% sample_n(900,replace=T); dfupsamp <- rbind(df,newneg) upsep <- tune(svm,y~x,data=dfupsamp,ranges=list(gamma = 2^(-1:1), cost = 2^(2:4))) dfupsamp$ypred <- predict(upsep$best.model); ggplot(dfupsamp,aes(x=x,fill=y)) + geom_histogram(alpha=0.2,position="identity",bins=51) + geom_point(aes(y=ypred,colour=ypred)) + scale_color_discrete(drop=FALSE)
df$upsampred <- predict(upsep$best.model,df) mean(df$y == df$ypred)
## [1] 0.95
mean(df$y == df$upsampred)
## [1] 0.694
No upsampling: 95% Accuracy
Upsampled: 68% Accuracy
So why do you like the upsampled classifier better?
True class | |||
Total population | class positive | class negative | |
Predicted class |
Predicted class positive |
True positive |
False positive (Type I error) |
Predicted class negative |
False negative (Type II error) |
True negative |
Careful! A “false positive” is actually a negative and a “false negative” is actually a positive.
library(caret) cm_orig <- confusionMatrix(df$ypred, df$y, positive = "1", mode = "prec_recall") print(cm_orig$table)
## Reference ## Prediction -1 1 ## -1 950 50 ## 1 0 0
cm_up <- confusionMatrix(df$upsampred, df$y, positive = "1", mode = "prec_recall") print(cm_up$table)
## Reference ## Prediction -1 1 ## -1 654 10 ## 1 296 40
Precision
= Σ True positive Σ Predicted positive |
Recall
= Σ True positive Σ Class positive |
\[\mbox{F-measure} = 2 \frac{\mathrm{Precision}\cdot\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}\] https://en.wikipedia.org/wiki/F1_score
\[\mbox{F-measure} = 2 \frac{\mathrm{Precision}\cdot\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}\] https://en.wikipedia.org/wiki/F1_score
For the “always predict -1” classifier, recall = 0, precision = 0, so
F-measure = 0.
For the classifier learned from up-sampled data,
prec <- sum(df$y == 1 & df$upsampred == 1) / sum(df$upsampred == 1) recall <- sum(df$y == 1 & df$upsampred == 1) / sum(df$y == 1) F1.upsamp <- 2 * prec*recall / (prec + recall) print(F1.upsamp)
## [1] 0.2072539
NOTE that F-measure is not “symmetric”; it depends on the definition of the positive class. Typically used when positive class is rare but important e.g. information retrieval.
Sensitivity
= Σ True positive Σ Class positive |
Specificity
= Σ True negative Σ Class negative |
\[\mbox{BalancedAccuracy} = \frac{1}{2} (\mathrm{Sensitivity}+\mathrm{Specificity})\]
https://en.wikipedia.org/wiki/Accuracy_and_precision#In_binary_classification
Note: Sensitivity is same as Recall.
\[\mbox{BalancedAccuracy} = \frac{1}{2} (\mathrm{Sensitivity}+\mathrm{Specificity})\]
For “always predict -1” classifier, sensitivity = 0, specificity = 1, balanced accuracy = 0.5.
For the classifier learned from up-sampled data,
sens <- sum(df$y == 1 & df$upsampred == 1) / sum(df$y == 1) spec <- sum(df$y == -1 & df$upsampred == -1) / sum(df$y == -1) bal.acc.upsamp <- 0.5*(sens + spec) print(bal.acc.upsamp)
## [1] 0.7442105
https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers
True class | |||||
Total population | class positive | class negative | Prevalence = Σ class positive Σ Total population | ||
Predicted class |
Predicted class positive |
True positive |
False positive (Type I error) |
Positive predictive value (PPV), Precision = Σ True positive Σ Predicted positive | False discovery rate (FDR) = Σ False positive Σ Predicted positive |
Predicted class negative |
False negative (Type II error) |
True negative | False omission rate (FOR) = Σ False negative Σ Predicted negative | Negative predictive value (NPV) = Σ True negative Σ Predicted negative | |
Accuracy (ACC)
=
Σ True positive + Σ True negative Σ Total population |
True positive rate (TPR), Sensitivity, Recall = Σ True positive Σ Class positive | False positive rate (FPR), Fall-out = Σ False positive Σ Class negative | Positive likelihood ratio (LR+) = TPR FPR | Diagnostic odds ratio (DOR) = LR+ LR− | |
False negative rate (FNR), Miss rate = Σ False negative Σ Class positive | True negative rate (TNR), Specificity (SPC) = Σ True negative Σ Class negative | Negative likelihood ratio (LR−) = FNR TNR |
Sensitivity
= Σ True positive Σ Class positive |
Specificity
= Σ True negative Σ Class negative |
Recall:
\[\mbox{BalancedAccuracy} = \frac{1}{2} (\mathrm{Sensitivity}+\mathrm{Specificity})\]
What if e.g. false positives are more costly than false negatives?
Let \(\mathrm{P}\) and \(\mathrm{N}\) the proportions of positives and negatives in the population.
\[\mathrm{FNRate} = (1 - \mathrm{Sensitivity}), \mathrm{FPRate} = (1 - \mathrm{Specificity})\]
\[\mbox{NormExpectedCost} = c_{\mathrm{FP}}\cdot\mathrm{FPRate}\cdot\mathrm{P}+ c_{\mathrm{FN}}\cdot\mathrm{FNRate}\cdot\mathrm{N}\]
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
Think: “If I fix FPR at 0.4, what is my TPR?”
Obviously, higher is better. Random guessing gives an ROC curve along \(y = x\).
If the area under the curve (AUC) is 1, we have a perfect classifier. AUC of 0.5 is worst possible.
Very common measure of classifier performance, especially when classes are imbalanced.
library(ROCR) preds <- attr(predict(rsep$best.model,df,decision.values = TRUE),"decision.values") plot(performance(prediction(preds,df$y),"tpr","fpr")) abline(a = 0, b = 1)
performance(prediction(preds,df$y),"auc")@y.values
## [[1]] ## [1] 0.4750316
library(ROCR) preds <- attr(predict(upsep$best.model,df,decision.values = TRUE),"decision.values") plot(performance(prediction(preds,df$y),"tpr","fpr")) abline(a = 0, b = 1)
performance(prediction(preds,df$y),"auc")@y.values
## [[1]] ## [1] 0.7996842
If we care about all these measures, why do we optimize misclassification rate, or margin, or likelihood?
Classifier learned the way we described often perform well measures presented here