## Review: Error Rate / Accuracy

Compute the proportion that were correctly or incorrectly classified.

$\mathrm{Accuracy} = n^{-1} \sum_{i=1}^n 1({\hat{y}}_i = y_i)$

$\mathrm{Error Rate} = n^{-1} \sum_{i=1}^n 1(\hat{y}_i \ne y_i)$

## Imbalanced classes

• Suppose in the true population, 95% are negative.
• Classifier that always outputs negative is 95% accurate. This is the baseline accuracy.
• Accuracy is not a useful measure.

Literature on learning from unbalanced classes:

## Example: 50% Positive, 50% Negative

npos <- 500; nneg <- 500; set.seed(1)
df <- rbind(data.frame(x=rnorm(npos,mupos), y=1),data.frame(x=rnorm(nneg,muneg),y=-1)); df$y <- as.factor(df$y)
sep <- tune(svm,y~x,data=df,ranges=list(gamma = 2^(-1:1), cost = 2^(2:4)))
df$ypred <- predict(sep$best.model)
ggplot(df,aes(x=x,fill=y)) + geom_histogram(alpha=0.2,position="identity",bins=51) + geom_point(aes(y=ypred,colour=ypred)) + scale_color_discrete(drop=FALSE)

## Example: 5% Positive, 95% Negative

npos <- 50; nneg <- 950; set.seed(1)
df <- rbind(data.frame(x=rnorm(npos,mupos), y=1),data.frame(x=rnorm(nneg,muneg),y=-1)); df$y <- as.factor(df$y)
rsep <- tune(svm,y~x,data=df,ranges=list(gamma = 2^(-1:1), cost = 2^(2:4)))
df$ypred <- predict(rsep$best.model);
ggplot(df,aes(x=x,fill=y)) + geom_histogram(alpha=0.2,position="identity",bins=51) + geom_point(aes(y=ypred,colour=ypred)) + scale_color_discrete(drop=FALSE)

## Example: Upsampling

newneg <- df %>% filter(y == 1) %>% sample_n(900,replace=T); dfupsamp <- rbind(df,newneg)
upsep <- tune(svm,y~x,data=dfupsamp,ranges=list(gamma = 2^(-1:1), cost = 2^(2:4)))
dfupsamp$ypred <- predict(upsep$best.model);
ggplot(dfupsamp,aes(x=x,fill=y)) + geom_histogram(alpha=0.2,position="identity",bins=51) + geom_point(aes(y=ypred,colour=ypred)) + scale_color_discrete(drop=FALSE)

## Upsampling: Accuracy

df$upsampred <- predict(upsep$best.model,df)
mean(df$y == df$ypred)
## [1] 0.95
mean(df$y == df$upsampred)
## [1] 0.694

No upsampling: 95% Accuracy

Upsampled: 68% Accuracy

So why do you like the upsampled classifier better?

## Definitions: True/False Positives/Negatives

 True class Total population class positive class negative Predicted class Predicted class positive True positive False positive (Type I error) Predicted class negative False negative (Type II error) True negative

Careful! A “false positive” is actually a negative and a “false negative” is actually a positive.

## Confusion Matrix

• Unbalanced classes
library(caret)
cm_orig <- confusionMatrix(df$ypred, df$y, positive = "1", mode = "prec_recall")
print(cm_orig$table) ## Reference ## Prediction -1 1 ## -1 950 50 ## 1 0 0 • Upsampled Data cm_up <- confusionMatrix(df$upsampred, df$y, positive = "1", mode = "prec_recall") print(cm_up$table)
##           Reference
## Prediction  -1   1
##         -1 654  10
##         1  296  40

## Precision and Recall, F-measure

 Precision = Σ True positiveΣ Predicted positive Recall = Σ True positiveΣ Class positive

• In Information Retrieval, typically very few positives, many negatives. (E.g. billion webpages, dozen relevant to search query.) Focus is on correctly identifying positives.
• Recall: What proportion of the positives in the population do I correctly capture?
• Precision: What proportion of the instances I labeled positive are actually positive?

$\mbox{F-measure} = 2 \frac{\mathrm{Precision}\cdot\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}$ https://en.wikipedia.org/wiki/F1_score

## F-measure Example

$\mbox{F-measure} = 2 \frac{\mathrm{Precision}\cdot\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}$ https://en.wikipedia.org/wiki/F1_score

For the “always predict -1” classifier, recall = 0, precision = 0, so
F-measure = 0.

For the classifier learned from up-sampled data,

prec <- sum(df$y == 1 & df$upsampred == 1) / sum(df$upsampred == 1) recall <- sum(df$y == 1 & df$upsampred == 1) / sum(df$y == 1)
F1.upsamp <- 2 * prec*recall / (prec + recall)
print(F1.upsamp)
## [1] 0.2072539

NOTE that F-measure is not “symmetric”; it depends on the definition of the positive class. Typically used when positive class is rare but important e.g. information retrieval.

## Sensitivity and Specificity, Balanced Accuracy

 Sensitivity = Σ True positiveΣ Class positive Specificity = Σ True negativeΣ Class negative

• Sensitivity: What proportion of the positives in the population do I correctly label?
• Specificity: What proportion of the negatives in the population do I correctly label?

$\mbox{BalancedAccuracy} = \frac{1}{2} (\mathrm{Sensitivity}+\mathrm{Specificity})$

https://en.wikipedia.org/wiki/Accuracy_and_precision#In_binary_classification
Note: Sensitivity is same as Recall.

## Balanced accuracy Example

• Sensitivity: What proportion of the positives in the population do I correctly label?
• Specificity: What proportion of the negatives in the population do I correctly label?

$\mbox{BalancedAccuracy} = \frac{1}{2} (\mathrm{Sensitivity}+\mathrm{Specificity})$

For “always predict -1” classifier, sensitivity = 0, specificity = 1, balanced accuracy = 0.5.

For the classifier learned from up-sampled data,

sens <- sum(df$y == 1 & df$upsampred == 1) / sum(df$y == 1) spec <- sum(df$y == -1 & df$upsampred == -1) / sum(df$y == -1)
bal.acc.upsamp <- 0.5*(sens + spec)
print(bal.acc.upsamp)
## [1] 0.7442105

## Many Measures

https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers

 True class Total population class positive class negative Prevalence = Σ class positiveΣ Total population Predicted class Predicted class positive True positive False positive (Type I error) Positive predictive value (PPV), Precision = Σ True positiveΣ Predicted positive False discovery rate (FDR) = Σ False positiveΣ Predicted positive Predicted class negative False negative (Type II error) True negative False omission rate (FOR) = Σ False negativeΣ Predicted negative Negative predictive value (NPV) = Σ True negativeΣ Predicted negative Accuracy (ACC) = Σ True positive + Σ True negativeΣ Total population True positive rate (TPR), Sensitivity, Recall = Σ True positiveΣ Class positive False positive rate (FPR), Fall-out = Σ False positiveΣ Class negative Positive likelihood ratio (LR+) = TPRFPR Diagnostic odds ratio (DOR) = LR+LR− False negative rate (FNR), Miss rate = Σ False negativeΣ Class positive True negative rate (TNR), Specificity (SPC) = Σ True negativeΣ Class negative Negative likelihood ratio (LR−) = FNRTNR

## Cost sensitivity

 Sensitivity = Σ True positiveΣ Class positive Specificity = Σ True negativeΣ Class negative

Recall:

$\mbox{BalancedAccuracy} = \frac{1}{2} (\mathrm{Sensitivity}+\mathrm{Specificity})$

What if e.g. false positives are more costly than false negatives?

Let $$\mathrm{P}$$ and $$\mathrm{N}$$ the proportions of positives and negatives in the population.

$\mathrm{FNRate} = (1 - \mathrm{Sensitivity}), \mathrm{FPRate} = (1 - \mathrm{Specificity})$

$\mbox{NormExpectedCost} = c_{\mathrm{FP}}\cdot\mathrm{FPRate}\cdot\mathrm{P}+ c_{\mathrm{FN}}\cdot\mathrm{FNRate}\cdot\mathrm{N}$

http://www.csi.uottawa.ca/~cdrummon/pubs/pakdd08.pdf

• Suppose classifier can rank inputs according to “how positive” they appear to be.
• E.g., can use probability from Logistic Regression, or $$w^{\mathsf T}x + b$$ for SVM.
• By adjusting the “threshold” value for deciding an instance is positive, we can obtain different false positive rates. Low threshold gives higher false positives (but higher true negatives), high threshold gives lower false positives (but higher false negatives.)
• ROC curve: Try all possible cutoffs, plot FPR on $$x$$-axis, TPR on $$y$$-axis.

Think: “If I fix FPR at 0.4, what is my TPR?”

Obviously, higher is better. Random guessing gives an ROC curve along $$y = x$$.

If the area under the curve (AUC) is 1, we have a perfect classifier. AUC of 0.5 is worst possible.

Very common measure of classifier performance, especially when classes are imbalanced.

## ROC example - Original: AUC = 0.475

library(ROCR)
preds <- attr(predict(rsep$best.model,df,decision.values = TRUE),"decision.values") plot(performance(prediction(preds,df$y),"tpr","fpr"))
abline(a = 0, b = 1)

performance(prediction(preds,df$y),"auc")@y.values ## [[1]] ## [1] 0.4750316 ## ROC example - Upsampled: AUC = 0.799 library(ROCR) preds <- attr(predict(upsep$best.model,df,decision.values = TRUE),"decision.values")
plot(performance(prediction(preds,df\$y),"tpr","fpr"))
abline(a = 0, b = 1)