Review: Error Rate / Accuracy

Compute the proportion that were correctly or incorrectly classified.

\[\mathrm{Accuracy} = n^{-1} \sum_{i=1}^n 1({\hat{y}}_i = y_i)\]

\[\mathrm{Error Rate} = n^{-1} \sum_{i=1}^n 1(\hat{y}_i \ne y_i)\]

Imbalanced classes


Literature on learning from unbalanced classes:

http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5299216 http://link.springer.com/article/10.1007%2Fs10115-013-0670-6 http://www.computer.org/csdl/proceedings/icdm/2012/4905/00/4905a695-abs.html http://www.computer.org/csdl/proceedings/icdm/2011/4408/00/4408a754-abs.html

Example: 50% Positive, 50% Negative

npos <- 500; nneg <- 500; set.seed(1)
df <- rbind(data.frame(x=rnorm(npos,mupos), y=1),data.frame(x=rnorm(nneg,muneg),y=-1)); df$y <- as.factor(df$y)
sep <- tune(svm,y~x,data=df,ranges=list(gamma = 2^(-1:1), cost = 2^(2:4)))
df$ypred <- predict(sep$best.model)
ggplot(df,aes(x=x,fill=y)) + geom_histogram(alpha=0.2,position="identity",bins=51) + geom_point(aes(y=ypred,colour=ypred)) + scale_color_discrete(drop=FALSE)

Example: 5% Positive, 95% Negative

npos <- 50; nneg <- 950; set.seed(1)
df <- rbind(data.frame(x=rnorm(npos,mupos), y=1),data.frame(x=rnorm(nneg,muneg),y=-1)); df$y <- as.factor(df$y)
rsep <- tune(svm,y~x,data=df,ranges=list(gamma = 2^(-1:1), cost = 2^(2:4)))
df$ypred <- predict(rsep$best.model);
ggplot(df,aes(x=x,fill=y)) + geom_histogram(alpha=0.2,position="identity",bins=51) + geom_point(aes(y=ypred,colour=ypred)) + scale_color_discrete(drop=FALSE)

Example: Upsampling

newneg <- df %>% filter(y == 1) %>% sample_n(900,replace=T); dfupsamp <- rbind(df,newneg)
upsep <- tune(svm,y~x,data=dfupsamp,ranges=list(gamma = 2^(-1:1), cost = 2^(2:4)))
dfupsamp$ypred <- predict(upsep$best.model);
ggplot(dfupsamp,aes(x=x,fill=y)) + geom_histogram(alpha=0.2,position="identity",bins=51) + geom_point(aes(y=ypred,colour=ypred)) + scale_color_discrete(drop=FALSE)

Upsampling: Accuracy

df$upsampred <- predict(upsep$best.model,df)
mean(df$y == df$ypred)
## [1] 0.95
mean(df$y == df$upsampred)
## [1] 0.694

No upsampling: 95% Accuracy

Upsampled: 68% Accuracy

So why do you like the upsampled classifier better?

Definitions:
True/False Positives/Negatives


True class
Total population class positive class negative
Predicted
class
Predicted class
positive
True positive False positive
(Type I error)
Predicted class
negative
False negative
(Type II error)
True negative


Careful! A “false positive” is actually a negative and a “false negative” is actually a positive.

Confusion Matrix

library(caret)
cm_orig <- confusionMatrix(df$ypred, df$y, positive = "1", mode = "prec_recall")
print(cm_orig$table)
##           Reference
## Prediction  -1   1
##         -1 950  50
##         1    0   0
cm_up <- confusionMatrix(df$upsampred, df$y, positive = "1", mode = "prec_recall")
print(cm_up$table)
##           Reference
## Prediction  -1   1
##         -1 654  10
##         1  296  40

Precision and Recall, F-measure

Precision =
Σ True positiveΣ Predicted positive
Recall =
Σ True positiveΣ Class positive


\[\mbox{F-measure} = 2 \frac{\mathrm{Precision}\cdot\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}\] https://en.wikipedia.org/wiki/F1_score

F-measure Example

\[\mbox{F-measure} = 2 \frac{\mathrm{Precision}\cdot\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}\] https://en.wikipedia.org/wiki/F1_score

For the “always predict -1” classifier, recall = 0, precision = 0, so
F-measure = 0.

For the classifier learned from up-sampled data,

prec <- sum(df$y == 1 & df$upsampred == 1) / sum(df$upsampred == 1)
recall <- sum(df$y == 1 & df$upsampred == 1) / sum(df$y == 1)
F1.upsamp <- 2 * prec*recall / (prec + recall)
print(F1.upsamp)
## [1] 0.2072539

NOTE that F-measure is not “symmetric”; it depends on the definition of the positive class. Typically used when positive class is rare but important e.g. information retrieval.

Sensitivity and Specificity,
Balanced Accuracy


Sensitivity =
Σ True positiveΣ Class positive
Specificity =
Σ True negativeΣ Class negative


\[\mbox{BalancedAccuracy} = \frac{1}{2} (\mathrm{Sensitivity}+\mathrm{Specificity})\]

https://en.wikipedia.org/wiki/Accuracy_and_precision#In_binary_classification
Note: Sensitivity is same as Recall.

Balanced accuracy Example

\[\mbox{BalancedAccuracy} = \frac{1}{2} (\mathrm{Sensitivity}+\mathrm{Specificity})\]

For “always predict -1” classifier, sensitivity = 0, specificity = 1, balanced accuracy = 0.5.

For the classifier learned from up-sampled data,

sens <- sum(df$y == 1 & df$upsampred == 1) / sum(df$y == 1)
spec <- sum(df$y == -1 & df$upsampred == -1) / sum(df$y == -1)
bal.acc.upsamp <- 0.5*(sens + spec)
print(bal.acc.upsamp)
## [1] 0.7442105

Many Measures

https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers

True class
Total population class positive class negative Prevalence = Σ class positiveΣ Total population
Predicted
class
Predicted class
positive
True positive False positive
(Type I error)
Positive predictive value (PPV), Precision = Σ True positiveΣ Predicted positive False discovery rate (FDR) = Σ False positiveΣ Predicted positive
Predicted class
negative
False negative
(Type II error)
True negative False omission rate (FOR) = Σ False negativeΣ Predicted negative Negative predictive value (NPV) = Σ True negativeΣ Predicted negative
Accuracy (ACC) = Σ True positive
+ Σ True negative
Σ Total population
True positive rate (TPR), Sensitivity, Recall = Σ True positiveΣ Class positive False positive rate (FPR), Fall-out = Σ False positiveΣ Class negative Positive likelihood ratio (LR+) = TPRFPR Diagnostic odds ratio (DOR) = LR+LR−
False negative rate (FNR), Miss rate = Σ False negativeΣ Class positive True negative rate (TNR), Specificity (SPC) = Σ True negativeΣ Class negative Negative likelihood ratio (LR−) = FNRTNR

Cost sensitivity

Sensitivity =
Σ True positiveΣ Class positive
Specificity =
Σ True negativeΣ Class negative

Recall:

\[\mbox{BalancedAccuracy} = \frac{1}{2} (\mathrm{Sensitivity}+\mathrm{Specificity})\]

What if e.g. false positives are more costly than false negatives?

Let \(\mathrm{P}\) and \(\mathrm{N}\) the proportions of positives and negatives in the population.

\[\mathrm{FNRate} = (1 - \mathrm{Sensitivity}), \mathrm{FPRate} = (1 - \mathrm{Specificity})\]

\[\mbox{NormExpectedCost} = c_{\mathrm{FP}}\cdot\mathrm{FPRate}\cdot\mathrm{P}+ c_{\mathrm{FN}}\cdot\mathrm{FNRate}\cdot\mathrm{N}\]

http://www.csi.uottawa.ca/~cdrummon/pubs/pakdd08.pdf

Receiver operating characteristic (ROC)


https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Reading an ROC Curve

Think: “If I fix FPR at 0.4, what is my TPR?”

Obviously, higher is better. Random guessing gives an ROC curve along \(y = x\).

If the area under the curve (AUC) is 1, we have a perfect classifier. AUC of 0.5 is worst possible.

Very common measure of classifier performance, especially when classes are imbalanced.

ROC example - Original: AUC = 0.475

library(ROCR)
preds <- attr(predict(rsep$best.model,df,decision.values = TRUE),"decision.values")
plot(performance(prediction(preds,df$y),"tpr","fpr"))
abline(a = 0, b = 1)

performance(prediction(preds,df$y),"auc")@y.values
## [[1]]
## [1] 0.4750316

ROC example - Upsampled: AUC = 0.799

library(ROCR)
preds <- attr(predict(upsep$best.model,df,decision.values = TRUE),"decision.values")
plot(performance(prediction(preds,df$y),"tpr","fpr"))
abline(a = 0, b = 1)

performance(prediction(preds,df$y),"auc")@y.values
## [[1]]
## [1] 0.7996842

AUROC, \(c\)-statistic

Big picture: Optimizing classifiers

If we care about all these measures, why do we optimize misclassification rate, or margin, or likelihood?