Introductory Tutorial
10 Nov 2014Note: Previously there was an error in the label generation step. The spam examples were labeled ‘1’ and the ham examples ‘0’. This has now been corrected
Prerequisites:
- Scikit-learn
- the [Tutorial kit] (https://github.com/Winterflower/mockmail-intro/releases/tag/v1.0)
- Download the Tutorial kit and unzip it
- Change to the root of the Tutorial kit directory
- Open the the file introductory_tutorial.py using your favorite text editor (I quite like Atom, but you are welcome to use anything you like)
##1. Importing required modules For this tutorial, we will require three modules
- numpy
- sklearn.naive_bayes
- text_adapter
Import these into your Python script.
import numpy as np #too lazy to type numpy every time
import text_adapter
from sklearn.naive_bayes import BernoulliNB
In the last statement we choose to import only
the class BernoulliNB, because we will not be
needing the other sklearn.naive_bayes
classes.
##2. Preprocessing the training data for our classifier The training data (the HAM and SPAM emails) have been provided for you in the script.
spam_emails=["Hello send your password", "hello please click link", "click link",
"your password here", "send password"]
ham_emails=["hello reset your password", "password email", "warm hello" ]
Now, our goal is to process the data into a format accepted by the BernoulliNB class.
If we navigate to the [documentation for BernoulliNB] (http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html) we can see that the fit function accepts as an argument a Numpy matrix where each sample is represented by a row and a Numpy array of class labels (let’s denote SPAM by 0 and HAM by 1).
- Concatenate the SPAM and HAM emails into one list for easy processing
training_data=spam_emails+ham_emails
- Create a one-dimensional Numpy array of class labels
We have to assign a class label (0 or 1) to each email in the training_data
list we created above. An easy way to do this is to use a mixture
of repetition and list concatenation. An example is shown below, but
you are welcome to come up with your own.
training_labels=np.array([0]*len(ham_emails)+[1]*len(spam_emails))
- Create a a Numpy feature matrix
To save time, you can create the feature matrix using the
create_scikit_matrix
method in the text-adapter module.
Unless you used a different import statement, do
not forget to tell Python that the method create_scikit_matrix
lives in the text-adapter module.
Do this by calling the method as follows
text_adapter.create_scikit_matrix
.
training_data_matrix=text_adapter.create_scikit_matrix(training_data)
##3. Creating and training our classifier Now comes the really fun part! So far, we have preprocessed our data and we have a Numpy matrix with our feature_vectors and a Numpy array with class labels. We are ready to do some machine learning!
First, let’s use the Scikit-learn API to create an instance
of the BernoulliNB class. For future use, we will store
this object in a variable called classifier
classifier=BernoulliNB()
Next, train the classifier by using our training data and training class labels.
classifier.fit(training_data_matrix, training_labels)
Hooray!
Test your classifier on a new incoming email
test_email="hello please send password"
To compute the posterior probability that this email is spam, we have to first convert it to a feature vector. As a reminder, we are using the words that occur in the training data set as our features. If the word is present in the email, we give it a value of 1. Else the value is 0.
#compute the dictionary of words based on the training set we defined above
training_dictionary=text_adapter.dictionary_builder(training_data)
test_feature_vector=np.array(text_adapter.binary_feature_vector_builder(training_dictionary,test_email))
In Scikit-learn, probability of a datapoint belonging to a certain
class (in our case HAM or SPAM) can be calculated by calling
the method predict_proba()
and passing the feature vector as an argument.
print classifier.predict_proba(test_feature_vector)
If you execute the statement above, you will notice that it returns two probabilities. The first probability is for class 0 (SPAM) and the second for class 1 (HAM).