Assessing the performance of a classifier

Suppose that we have an algorithm whose purpose is to classify uknown data into some number classes. How do we assess the performance of this algorithm? Let's look at some of the possibilities. To make things concrete, we'll first do a logistic regression multinomial classifier on (part of) the MNIST dataset. Let's start with the binary case, and look at 5's and 8's, which are confusing.

In [72]:
from bokeh.plotting import figure
from import output_notebook, show
from bokeh.palettes import Spectral8
from bokeh.transform import linear_cmap
from bokeh.models import ColumnDataSource, ColorBar
Loading BokehJS ...
In [77]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, roc_curve, roc_auc_score
mnist = pd.read_csv('../data/MNIST/train.csv')
zeros_ones = mnist[(mnist['label']==5) | (mnist['label']==8)]
labels = zeros_ones['label'].values
images = zeros_ones.iloc[:,1:].values
images = images/255
im_train, im_test, lab_train, lab_test = train_test_split(images,labels, test_size=.8,random_state=11)
In [19]:
model = LogisticRegression(solver='lbfgs',multi_class='multinomial',max_iter=1000).fit(im_train, lab_train)

The Logistic Model allows us to predict the class of a new image. The most basic tool to assess the classification is the confusion matrix.

In [20]:
predicted = model.predict(im_test[:1000])
true = lab_test[:1000]

The accuracy of this test is the number of correct results out of the total number of results. In this case that is 94.4%.

In [24]:
In [22]:
confusion_matrix(true, predicted)
array([[486,  24],
       [ 32, 458]])

The entries of this matrix are:

Predicted=5 Predicted=8
True=5 486 24
True = 8 32 458

Precision, Recall, and all that

Let's suppose for the moment that we consider an "8" a positive result and "5" a negative result. Therefore:

  • if our image is an 8 and our classifier says it's an 8, we have a true positive -- there are 458 true positives in this case.
  • if our image is an 8 and our classifier says it's a 5, we have a false negative -- there are 32 false negatives.
  • if our image is a 5 and our classifier says it's an 8, we have a false positive -- there are 24 false positives
  • if our image is a 5 and our classifier says it's a 5, we have a true negative -- there are 486 true negatives.

The rates of false positives and false negatives are a refined perspective on the types of errors we get from our test/classifier. For example, suppose that instead of classifying digits we were giving someone a test for cancer. In that case, the implications of

  • a false positive (telling someone that they do have cancer when they don't)


  • a false negative (telling them they don't have cancer when they do)

are very different.

The True Positive Rate is the proportion of true positives among all positives. In the example above, it's the number of things that are actually an 8 among all the things predicted to be an 8, or 458/(458+32) = 93.4%. This is also called sensitivity (esp. in medical terminology) or recall.

A medical test for a condition has high sensitivity if, assuming you have the condition, the test is likely to detect it.

The True Negative Rate is the propoportion of true negatives among all negatives; in our case it's the number of things that are 5 divided by the number predicted to be 5 or (486)/(486+24) = 95.2%. This is also called specificity especially in medical terminology.

A medical test with high specificity means that if you don't have the condition, the test is likely to give a negative result.

The Precision is the ratio of the true positives to the sum of the True and False Positives. In our case that is 458/(458+24) = 95%.

A medical test with high precision means that if the test is positive, you are highly likely to have the condition.

Thought Experiment. Imagine you have a medical test for colon cancer which is accurate 95% of the time. In other words, among the population as a whole, it correctly says yes, you have cancer when you do, or no, you don't have cancer, when you don't. And assume that in the population at large only 1% of the people actually have colon cancer.

The outcome of the test and the reality of whether you have cancer are independent. So in a population of one million people:

  • 10000 people have colon cancer
  • 9500 people with colon cancer get a positive test result
  • 500 people with colon cancer get a negative test result
  • 990000 do not have colon cancer
  • 940500 people without colon cancer get a negative test result
  • 49500 people without colon cancer get a positive test result

Here is the confusion matrix:

Has Colon Cancer Does not Have
positive test 9500 49500
negative test 500 940500
In [31]:
sensitivity = 9500/(10000)
specificity = 940500/(940500+500)
precision = 9500/(9500+49500)
In [38]:
print('sensitivity (recall) = {}, specificity = {}, precision = {}'.format(sensitivity,specificity,precision))
sensitivity (recall) = 0, specificity = 0.98999998999999, precision = 0

In other words, if you get a positive test for colon cancer, you only have about a 16% chance of actually having the disease.

Here is another test for colon cancer. When a patient comes in, we pick a random number between 1 and 1 million. If we get a one, we say "you have colon cancer". Otherwise, we say you don't.

What is the accuracy of this test? Well, among the 990000 people who don't have cancer, we would expect to get maybe one person with a positive test. And among the 10000 who do, we expect no positive results. So this test is basically 99% accurate. Here is the confusion matrix:

Has Colon Cancer Does not Have
positive test 0 1
negative test 10000 989999
In [36]:
sensitivity = 0
specificity = 989999/(999999)
precision = 0
In [37]:
print('sensitivity (recall) = {}, specificity = {}, precision = {}'.format(sensitivity,specificity,precision))
sensitivity (recall) = 0, specificity = 0.98999998999999, precision = 0

The main lesson of this thought experiment is that classification of rare events requires care!

Why precision and recall?

Imagine a document retrieval system where you do a search on some query terms and get back a bunch of documents. There are two ways to measure how effective this is:

  • are you actually interested in the documents it finds, or are there are a lot of irrelevant documents in the mix?
  • did you actually find everything you are interested in, or did you miss a lot of interesting things?

Obviously these two are in conflict, because if you retrieve more stuff to make sure you find everything interesting, you will get more irrelevant stuff along with it.

precision measures whether the stuff you retrieve (the "positive" scores) are actually things you are interested in. It's the ratio of the the "interesting" or "true positive" scores among all of the retrieved documents (all of the "positive scores", true and false). High precision means most of what you retrieve is interesting.

recall measures whether you really retrieved everything interesting; it's the ratio of the fraction of things you retrieved that you are interested in among everything you are interested in. High recall means you retrieved most of the interesting things.

Thresholds and ROC curves

Many classifiers actually compute some kind of score or probability, and the actual classification is done by setting a threshold and saying everything to one side of the threshold is a "positive result" and everything on the other side is a "negative result."

In the case of the logistic regression above, we actually computed probabilities and then said that if the probability was greater than or equal to .5 it was a positive result (an 8) and if it was less, it was a negative result (a 5).

We can study the effect of this threshold. One tool for doing this is called the Receiver Operating Characteristic Curve (ROC curve). The terminology goes back to the early days of radar and I would like to learn more about it if anyone can find anything out.

The ROC curve plots the True Positive rate against the False Positive rate as the threshold varies.

In [81]:
scores = model.predict_proba(im_test[:1000])
In [82]:
tpr, fpr, thresholds = roc_curve(true, scores[:,0],pos_label=8)
In [83]:
source = pd.DataFrame({'tpr':tpr,'fpr':fpr,'thresholds':thresholds})
roc = figure(title="ROC Curve",tooltips=[("threshold","@thresholds")],x_axis_label='FPR',y_axis_label='TPR')
mapper = linear_cmap(field_name='thresholds',palette=Spectral8,low=0,high=1)
color_bar = ColorBar(color_mapper=mapper['transform'],location=(0,0))

Area under the ROC curve

One final measure of the accuracy of a (threshold-based) classifier is the area under the ROC curve. The larger this is, the better the performance of the classifier regardless of the threshold.

In [80]: