In our discussion of Bayes Theorem, we looked at a situation in which we had a population consisting of people infected with COVID-19 and people not infected, and we had a test that we could apply to determine which class an individual belonged to. Because our test was not 100 percent reliable, a positive test result didn’t guarantee that a person was infected, and we used Bayes Theorem to evaluate how to interpret the positive test result. More specifically, our information about the test performance gave us the the conditional probabilities of positive and negative test results given infection status – so for example we were given
The Naive Bayes classification method is a generalization of this idea. We have data that belongs to one of two classes, and based on the results of a series of tests, we wish to decide which class a particular data point belongs to. For one example, we are given a collection of product reviews from a website and we wish to classify those reviews as either “positive” or “negative.” This type of problem is called “sentiment analysis.” For another, related example, we have a collection of emails or text messages and we wish to label those that are likely “spam” emails. In both of these examples, the “test” that we will apply is to look for the appearance or absence of certain key words that make the text more or less likely to belong to a certain class. For example, we might find that a movie review that contains the word “great” is more likely to be positive than negative, while a review that contains the word “boring” is more likely to be negative.
The reason for the word “naive” in the name of this method is that we will derive our probabilities from empirical data, rather than from any deeper theory. For example, to find the probability that a negative movie review contains the word “boring,” we will look at a bunch of reviews that our experts have said are negative, and compute the proportion of those that contain the word boring. Indeed, to develop our family of tests, we will rely on a training set of already classified data from which we can determine estimates of probabilities that we need.
To illustrate the Naive Bayes algorithm, we will work with the “Sentiment Labelled Sentences Data Set” ([1]). This dataset contains 3 files, each containing 1000 documents labelled
Let’s focus on the amazon reviews data. Here are some samples:
So there is no way for me to plug it in here in the US unless I go by a converter. 0
Good case, Excellent value. 1
Great for the jawbone. 1
Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!! 0
The mic is great. 1
I have to jiggle the plug to get it to line up right to get decent volume. 0
If you have several dozen or several hundred contacts, then imagine the fun of sending each of them one by one. 0
If you are Razr owner...you must have this! 1
Needless to say, I wasted my money. 0
What a waste of money and time!. 0
As you can see, each line consists of a product review followed by a
We will describe the “Bernoulli” version of a Naive Bayes classifier for this data. The building block of this method is a test based on a single word. For example, let’s consider the word great among all of our amazon reviews. It turns out that great occurs
+ | - | total | |
---|---|---|---|
great | 92 | 5 | 97 |
~great | 408 | 495 | 903 |
total | 500 | 500 | 1000 |
In this data, positive and negative reviews are equally likely so
and
Therefore
In other words, if you see the word great in a review, there’s a 95% chance that the review is positive.
What if you do not see the word great? A similar calculation from the table yields
In other words, not seeing the word great gives a little evidence that the review is negative (there’s a 55% chance it’s negative) but it’s not that conclusive.
The word waste is associated with negative reviews. It’s statistics are summarized in table 2.
+ | - | total | |
---|---|---|---|
waste | 0 | 14 | 14 |
~waste | 500 | 486 | 986 |
total | 500 | 500 | 1000 |
Based on this data, the “naive” probabilities we are interested in are:
In other words, if you see waste you definitely have a negative review, but if you don’t, you’re only slightly more likely to have a positive one.
What about combining these two tests? Or using even more words? We could analyze our data to count cases in which both great and waste occur, in which only one occurs, or in which neither occurs, within the two different categories of reviews, and then use those counts to estimate empirical probabilities of the joint events. But while this might be feasible with two words, if we want to use many words, the number of combinations quickly becomes huge. So instead, we make a basic, and probably false, assumption, but one that makes a simple analysis possible.
Assumption: We assume that the presence or absence of the words great and waste in a particular review (positive or negative) are independent events. More generally, given a collection of words
Independence means that we have
So for example, if a document contains the word great and does not contain the word waste, then the probability of it being a positive review is:
To generalize this, suppose that we have extracted keywords
Notice that we only count reviews, not ocurrences, so that if a word occurs multiple times in a review it only contributes 1 to the count. That’s why this is called the Bernoulli Naive Bayes – we are thinking of each keyword as yielding a yes/no test on each review.
Given a review, we look to see whether each of our
For example, if our keywords are
This phone is useless, useless, useless! What a waste!
then the associated feature vector is
For the purposes of classification of our reviews, we are going to forget entirely about the text of our reviews and work only with the feature vectors. From an abstract perspective, then, by choosing our
The labels of
Setting things up this way lets us express the computations of our probabilities
Note that the number of positive reviews is
If a review has an associated feature vector
These products aren’t practical to work with – they are often the product of many, many small numbers and are therefore really tiny. Therefore it’s much more practical to work with their logarithms.
If we have a group of reviews
By Bayes Theorem, we can express the chance that our review with feature vector
A natural classification rule would be to say that a review is positive if
Again we can exploit the matrix structure to do this for a bunch of reviews at once. Using eq. 3 and the vectors
In our analysis above, we thought of the presence or absence of certain key words as a set of independent tests that provided evidence of whether our review was positive or negative. This approach is suited to short pieces of text, but what about longer documents? In that case, we might want to consider not just the presence or absence of words, but the frequency with which they appear. Multinomial Naive Bayes, based on the “bag of words” model, is a classification method similar to Bernoulli Naive Bayes but which takes term frequency into account.
Let’s consider, as above, the problem of classifying documents into one of two classes. We assume that we have a set of keywords
The “bag of words” model says that we construct a document of length
In the Multinomial Naive Bayes method, we estimate the probabilities
As in the Bernoulli case, we often add a fake document to each class where all of the words occur once, in order to avoid having zero frequencies when we take a logarithm later.
Now, given a document, we associate a feature vector
From Bayes Theorem, we have
We classify our document by considering
In this comparison, both the constant (the multinomial coefficient) and the denominator cancel out, so we only need to compare
Therefore, just as in the Bernoulli case, we can package up our document
We developed the Naive Bayes method for sentiment analysis, but once we chose a set of keywords our training data was reduced to an
For example, suppose we have a collection of images represented as black/white pixels in a grid that belong to one of two classes. For example, we might have