# The scikit learn library¶

The scikit-learn library is a large open-source python package that carries out a wide range of machine learning algorithms. It has a particular structure, and once you understand that basic structure you can use the documentation to learn how to carry out nearly any algorithm. In this overview we will look at this structure for Linear Regression and the Naive Bayes model.

## General Overview¶

Typically, to apply the scikit learn library to a machine learning problem, you

1. construct an object that represents the algorithm you plan to apply (such as linear regression), setting whatever parameters you choose to control the details of the algorithm.

2. fit your training data using the fit method of your object, which computes the parameters of the algorithm from the data.

3. transform or predict from other data to compute the predicted values based on the fitted model.

4. score your results or evaluate them in some way.

## Illustration using Linear Regression¶

Let's look at how sklearn's linear regression fits this approach. It's always useful to have the documentation page for the method open so you can see your parameter options. In this case, we refer to the documentation for linear regression.

We will begin by generating some simulated data using the statistical model $$y=m_1x_1+m_2x_2+e$$ where $e$ is a normal variable with mean zero and standard deviation 5.

Using google, we find that linear regression is part of the linear_model submodule of the vast sklearn library, so we import that module.

### Step 1. Creating the Linear Regression object¶

Here we create a default linear regression object. Linear regression is a simple algorithm and there are only a few options to this construction; for example, we could say fit_intercept=False if our data was centered and we didn't want the algorithm to add a column of ones to compute the intercept. In our case, we will leave the defaults.

### Step 2. Fitting the data¶

The object L has a fit method that computes the parameters of the model -- the slopes and intercept. It takes the data X and the target variable y.

From the object A, we can extract the attributes or parameters of the model. The coef_ attribute are the coefficients (M) and the intercept_ attribute is the intercept.

### Step 3. Making predictions.¶

Now suppose we want to evaluate our linear model on some data. For example, we want to compute the predicted values yhat from our initial data. The predict method does this.

### Step 4. Scoring¶

The linear regression object in sklearn offers a score method that returns the coefficient of determination of the model. This is $$(1-MSE/\sigma_{y}^2)$$ so it compares the mean squared error to the inherent variance in the y values.

## Illustration of word counting¶

In the discussion of the "Naive Bayes" method, we are faced with the problem of converting text into a vector of word counts. Scikit learn includes a method for this, which we will now illustrate. The relevant tool is called CountVectorizer.

It's a good idea to refer to the relevant documentation here.

We also need some data, and according the documentation, the countvectorizer object can take input from a filename, a python file, or a sequence of strings. For simplicity we'll work with a few basic sentences. To get an apostrophe into a string, I escape it and write it as \'.

### Step 1. Create the object¶

The countvectorizer object has many options. Usually, the defaults are fine, but let's take note of a few interesting ones:

• Setting max_features to n limits the code to finding the n most common words in the text.
• Setting stop_words='english' tells the code to ignore a list of common words like "I", "at", and so on. You can also supply your own list of stop words as an option.
• The min_df and max_df flags allow you to screen out words whose document frequency is above or below a threshold, so for example if you have a 100 documents you can ignore words that occur in fewer than 20.
• The binary=True option tells the object to only determine if a word is present, rather than to count the occurrences; so it will set all non-zero counts to 1.

The vectorizer can also be given an explicit vocabulary to work with, can be told to look at character sequences instead of words, and can be asked to consider n-grams - groups of n words.

We'll just use the default version here.

### Step 2. Fit the data¶

We fit the data, creating an object A. From that object, we can extract the words the vectorizer found.

A.vocabulary is a python dictionary that associates each word in the vocabulary with a number. That number is an index so that when we create a matrix of documents by word counts, we can associate columns to keywords.

The get_feature_names() method returns a list of the vocabulary words in order.

### Step 3. Transforming data¶

Now we can use the fitted object to convert text to a matrix. The result is a sparse matrix, which is a way of storing the data efficiently since these matrices can be quite large. The toarray method gives us a numpy array.

In the matrix above, each row corresponds to one of our original sentences, and each column to a keyword. Refering back to the vocabulary index, we see that 'of' is column 7, and looking at column 7 of the matrix we see that 'of' occurs in the first and last sentence.

If we want the total occurrences of each word in the data, we can sum by columns.

Finally, we can compute word occurrences using the derived vocabulary by calling transform on new data.

Notice that Y is all zeros -- that's because none of the words in that sentence are in the vocabulary that the object computed from the original data.