2016-04-06

Performance Measures for Classification

Why different performance measures?


To date, we have focussed on accuracy: How often is my classifier correct on new data?

Depending on how the classifier will be applied, however, other measures may be more appropriate.

Review: Error Rate / Accuracy

Compute the proportion that were correctly or incorrectly classified.

\[\mathrm{Accuracy} = n^{-1} \sum_{i=1}^n 1(\hat{y}_i = y_i)\]

\[\mathrm{Error Rate} = n^{-1} \sum_{i=1}^n 1(\hat{y}_i \ne y_i)\]

Imbalanced classes

Example: 50% Positive, 50% Negative

npos <- 500; nneg <- 500; set.seed(1)
df <- rbind(data.frame(x=rnorm(npos,mupos), y=1),data.frame(x=rnorm(nneg,muneg),y=-1)); df$y <- as.factor(df$y)
sep <- tune(svm,y~x,data=df,ranges=list(gamma = 2^(-1:1), cost = 2^(2:4)))
df$ypred <- predict(sep$best.model)
ggplot(df,aes(x=x,fill=y)) + geom_histogram(alpha=0.2,position="identity",bins=51) + geom_point(aes(y=ypred,colour=ypred)) + scale_color_discrete(drop=FALSE)

Example: 5% Positive, 95% Negative

npos <- 50; nneg <- 950; set.seed(1)
df <- rbind(data.frame(x=rnorm(npos,mupos), y=1),data.frame(x=rnorm(nneg,muneg),y=-1)); df$y <- as.factor(df$y)
sep <- tune(svm,y~x,data=df,ranges=list(gamma = 2^(-1:1), cost = 2^(2:4)))
df$ypred <- predict(sep$best.model);
ggplot(df,aes(x=x,fill=y)) + geom_histogram(alpha=0.2,position="identity",bins=51) + geom_point(aes(y=ypred,colour=ypred)) + scale_color_discrete(drop=FALSE)

Example: Upsampling

newneg <- df %>% filter(y == 1) %>% sample_n(900,replace=T); dfupsamp <- rbind(df,newneg)
sep <- tune(svm,y~x,data=dfupsamp,ranges=list(gamma = 2^(-1:1), cost = 2^(2:4)))
dfupsamp$ypred <- predict(sep$best.model);
ggplot(dfupsamp,aes(x=x,fill=y)) + geom_histogram(alpha=0.2,position="identity",bins=51) + geom_point(aes(y=ypred,colour=ypred)) + scale_color_discrete(drop=FALSE)

Upsampling: Accuracy

df$upsampred <- predict(sep$best.model,df)
mean(df$y == df$ypred)
## [1] 0.95
mean(df$y == df$upsampred)
## [1] 0.684

No upsampling: 95% Accuracy

Upsampled: 70% Accuracy

So why do you like the upsampled classifier better?

Definitions:
True/False Positives/Negatives

Precision and Recall, F-measure

Precision =
Σ True positive Σ Predicted positive
Recall =
Σ True positive Σ Class positive


  • In Information Retrieval, typically very few positives, many negatives. (E.g. billion webpages, dozen relevant to search query.) Focus is on correctly identifying positives.
  • Recall: What proportion of the positives in the population do I correctly capture?
  • Precision: What proportion of the instances I labeled positive are actually positive?

\[\mbox{F-measure} = 2 \frac{\mathrm{Precision}\cdot\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}\] https://en.wikipedia.org/wiki/F1_score

F-measure Example

\[\mbox{F-measure} = 2 \frac{\mathrm{Precision}\cdot\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}\] https://en.wikipedia.org/wiki/F1_score

For the "always predict -1" classifier, recall = 0, precision = 0, so
F-measure = 0.

For the classifier learned from up-sampled data,

prec <- sum(df$y == 1 & df$upsampred == 1) / sum(df$upsampred == 1)
recall <- sum(df$y == 1 & df$upsampred == 1) / sum(df$y == 1)
F1.upsamp <- 2 * prec*recall / (prec + recall)
print(F1.upsamp)
## [1] 0.2020202

NOTE that F-measure is not "symmetric"; it depends on the definition of the positive class. Typically used when positive class is rare but important to an application.

Sensitivity and Specificity,
Balanced Accuracy

Balanced accuracy Example

  • Sensitivity: What proportion of the positives in the population do I correctly label?
  • Specificity: What proportion of the negatives in the population do I correctly label?

\[\mbox{BalancedAccuracy} = \frac{1}{2} (\mathrm{Sensitivity}+\mathrm{Specificity})\]

For "always predict -1" classifier, sensitivity = 0, specificity = 1, balanced accuracy = 0.5.

For the classifier learned from up-sampled data,

sens <- sum(df$y == 1 & df$upsampred == 1) / sum(df$y == 1)
spec <- sum(df$y == -1 & df$upsampred == -1) / sum(df$y == -1)
bal.acc.upsamp <- 0.5*(sens + spec)
print(bal.acc.upsamp)
## [1] 0.7389474

Many Measures

https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers

True class
Total population class positive class negative Prevalence = Σ class positive Σ Total population
Predicted
class
Predicted class
positive
True positive False positive
(Type I error)
Positive predictive value (PPV), Precision = Σ True positive Σ Predicted positive False discovery rate (FDR) = Σ False positive Σ Predicted positive
Predicted class
negative
False negative
(Type II error)
True negative False omission rate (FOR) = Σ False negative Σ Predicted negative Negative predictive value (NPV) = Σ True negative Σ Predicted negative
Accuracy (ACC) = Σ True positive
+ Σ True negative
Σ Total population
True positive rate (TPR), Sensitivity, Recall = Σ True positive Σ Class positive False positive rate (FPR), Fall-out = Σ False positive Σ Class negative Positive likelihood ratio (LR+) = TPR FPR Diagnostic odds ratio (DOR) = LR+ LR−
False negative rate (FNR), Miss rate = Σ False negative Σ Class positive True negative rate (TNR), Specificity (SPC) = Σ True negative Σ Class negative Negative likelihood ratio (LR−) = FNR TNR

Cost sensitivity

Sensitivity =
Σ True positive Σ Class positive
Specificity =
Σ True negative Σ Class negative

Recall:

\[\mbox{BalancedAccuracy} = \frac{1}{2} (\mathrm{Sensitivity}+\mathrm{Specificity})\]

What if e.g. false positives are more costly than false negatives?

Let \(\mathrm{P}\) and \(\mathrm{N}\) the proportions of positives and negatives in the population.

\[\mathrm{FNRate} = (1 - \mathrm{Sensitivity}), \mathrm{FPRate} = (1 - \mathrm{Specificity})\]

\[\mbox{NormExpectedCost} = c_{\mathrm{FP}}\cdot\mathrm{FPRate}\cdot\mathrm{P}+ c_{\mathrm{FN}}\cdot\mathrm{FNRate}\cdot\mathrm{N}\]

http://www.csi.uottawa.ca/~cdrummon/pubs/pakdd08.pdf

Receiver operating characteristic (ROC)


  • Suppose classifier can rank inputs according to "how positive" they appear to be.
  • E.g., can use probability from Logistic Regression, or \(w^{\mathsf T}x + b\) for SVM.
  • By adjusting the "threshold" value for deciding an instance is positive, we can obtain different false positive rates. Low threshold gives higher false positives (but higher true negatives), high threshold gives lower false positives (but higher false negatives.)
  • ROC curve: Try all possible cutoffs, plot FPR on \(x\)-axis, TPR on \(y\)-axis.

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Reading an ROC Curve

Think: "If I fix FPR at 0.4, what is my TPR?"

Obviously, higher is better. Random guessing gives an ROC curve along \(y = x\).

If the area under the curve (AUC) is 1, we have a perfect classifier. AUC of 0.5 is pretty bad.

Very common measure of classifier performance, especially when classes are imbalanced.

Big picture: Optimizing classifiers

If we care about all these measures, why do we optimize misclassification rate, or margin, or likelihood?

  • Computational tractability
  • Classifier learned the way we described often perform well measures presented here

  • However
    • There are methods for learning e.g. SVMs by optimizing ROC
    • Cost-sensitive learning is also widespread
    • Methods are evolving; a quick google scholar search is a good idea.

Performance Measures for Regression

Mean Errors

\[ \mathrm{MSE} = n^{-1} \sum_{i=1}^n (\hat y_i - y_i)^2\]

\[ \mathrm{RMSE} = \sqrt{ n^{-1} \sum_{i=1}^n (\hat y_i - y_i)^2 }\]

\[ \mathrm{MAE} = n^{-1} \sum_{i=1}^n |\hat y_i - y_i|\]

I find MAE easier to interpret. (How far am I from the correct value, on average?) RMSE is at least in the same units as the \(y\).

Mean Relative Error

\[ \mathrm{MRE} = n^{-1} \sum_{i=1}^n \frac{|\hat y_i - y_i|}{|y_i|}\]

Scales error according to magnitude of true \(y\). E.g., if MRE=\(0.2\), then regression is wrong by 20% of the value of \(y\), on average.

If this is appropriate for your problem then linear regression, which assumes additive error, may not be appropriate. Options include using a different model or regression on \(\log y\) rather than on \(y\).

https://en.wikipedia.org/wiki/Approximation_error#Formal_Definition