The scikit-learn library is a large open-source python package that carries out a wide range of machine learning algorithms. It has a particular structure, and once you understand that basic structure you can use the documentation to learn how to carry out nearly any algorithm. In this overview we will look at this structure for Linear Regression and the Naive Bayes model.
Typically, to apply the scikit learn library to a machine learning problem, you
construct an object that represents the algorithm you plan to apply (such as linear regression), setting whatever parameters you choose to control the details of the algorithm.
fit your training data using the fit
method of your object, which computes the parameters of the algorithm from the data.
transform or predict from other data to compute the predicted values based on the fitted model.
score your results or evaluate them in some way.
Let's look at how sklearn's linear regression fits this approach. It's always useful to have the documentation page for the method open so you can see your parameter options. In this case, we refer to the documentation for linear regression.
We will begin by generating some simulated data using the statistical model $$ y=m_1x_1+m_2x_2+e $$ where $e$ is a normal variable with mean zero and standard deviation 5.
import numpy as np
from bokeh.plotting import figure
from bokeh.io import output_notebook, show
output_notebook()
x1 = np.random.uniform(-5,5,size=20)
x2 = np.random.uniform(-5,5,size=20)
X = np.stack([x1,x2],axis=1)
print('Stacking x1 and x2 along axis=1 creates a matrix of shape ',X.shape)
y = 3*x1-5*x2+np.random.normal(0,5,size=20)
Using google, we find that linear regression is part of the linear_model submodule of the vast sklearn library, so we import that module.
from sklearn.linear_model import LinearRegression
Here we create a default linear regression object. Linear regression
is a simple algorithm and there are only a few options to this construction;
for example, we could say fit_intercept=False
if our data was centered and we didn't want the algorithm to add a column of ones to compute the intercept. In our case, we will leave the defaults.
L = LinearRegression()
The object L has a fit method that computes the parameters of the model -- the slopes and intercept. It takes the data X and the target variable y.
A = L.fit(X,y)
From the object A, we can extract the attributes
or parameters of the model. The coef_
attribute are the coefficients (M) and the intercept_
attribute is the intercept.
A.coef_
A.intercept_
Now suppose we want to evaluate our linear model on some data. For example, we want to compute the predicted values yhat from our initial data. The predict
method does this.
yhat = L.predict(X)
f=figure(title = 'y vs yhat')
f.scatter(x=y,y=yhat)
show(f)
The linear regression object in sklearn offers a score method that returns the coefficient of determination of the model. This is $$ (1-MSE/\sigma_{y}^2) $$ so it compares the mean squared error to the inherent variance in the y values.
L.score(X,y)
In the discussion of the "Naive Bayes" method, we are faced with the problem of converting text into a vector of word counts. Scikit learn includes
a method for this, which we will now illustrate. The relevant tool
is called CountVectorizer.
It's a good idea to refer to the relevant documentation here.
We also need some data, and according the documentation, the countvectorizer object can take input from a filename, a python file, or a sequence of strings.
For simplicity we'll work with a few basic sentences. To get an apostrophe into a string, I escape it and write it as \'
.
data = ['She\'s got diamonds on the soles of her shoes','Lucy in the sky with diamonds',
'Blue suede shoes','Oceans of diamonds']
from sklearn.feature_extraction.text import CountVectorizer
The countvectorizer object has many options. Usually, the defaults are fine, but let's take note of a few interesting ones:
max_features
to n limits the code to finding the n most common words in the text.stop_words='english'
tells the code to ignore a list of common words like "I", "at", and so on. You can also supply your own list of stop words as an option.min_df
and max_df
flags allow you to screen out words whose document frequency is above or below a threshold, so for example if you have a 100 documents you can ignore words that occur in fewer than 20.binary=True
option tells the object to only determine if a word is present, rather than to count the occurrences; so it will set all non-zero counts to 1. The vectorizer can also be given an explicit vocabulary to work with, can be told to look at character sequences instead of words, and can be asked to consider n-grams - groups of n words.
We'll just use the default version here.
V = CountVectorizer()
We fit the data, creating an object A. From that object, we can extract the words the vectorizer found.
A = V.fit(data)
A.vocabulary is a python dictionary that associates each word in the vocabulary with a number. That number is an index so that when we create a matrix of documents by word counts, we can associate columns to keywords.
A.vocabulary_
The get_feature_names() method returns a list of the vocabulary words in order.
A.get_feature_names()
Now we can use the fitted object to convert text to a matrix. The result is a sparse matrix, which is a way of storing the data efficiently since these matrices can be quite large. The toarray method gives us a numpy array.
X = A.transform(data)
X
X.toarray()
In the matrix above, each row corresponds to one of our original sentences, and each column to a keyword. Refering back to the vocabulary index, we see that 'of' is column 7, and looking at column 7 of the matrix we see that 'of' occurs in the first and last sentence.
If we want the total occurrences of each word in the data, we can sum by columns.
X.sum(axis=0)
Finally, we can compute word occurrences using the derived vocabulary by calling transform on new data.
Y=A.transform(['Here is a new sentence'])
Y.toarray()
Notice that Y is all zeros -- that's because none of the words in that sentence are in the vocabulary that the object computed from the original data.