Hello and welcome to thoughtwisps! This is a personal collection of notes and thoughts on software engineering, machine
learning and the technology industry and community. For my professional website, please see
race-conditions.
Thank you for visiting!
10 Nov 2014
Note: Previously there was an error in the label generation step. The spam examples were
labeled ‘1’ and the ham examples ‘0’. This has now been corrected
Prerequisites:
- Scikit-learn
- the [Tutorial kit] (https://github.com/Winterflower/mockmail-intro/releases/tag/v1.0)
- Download the Tutorial kit and unzip it
- Change to the root of the Tutorial kit directory
- Open the the file introductory_tutorial.py using your favorite text editor
(I quite like Atom, but you are welcome to use anything you like)
##1. Importing required modules
For this tutorial, we will require three modules
- numpy
- sklearn.naive_bayes
- text_adapter
Import these into your Python script.
import numpy as np #too lazy to type numpy every time
import text_adapter
from sklearn.naive_bayes import BernoulliNB
In the last statement we choose to import only
the class BernoulliNB, because we will not be
needing the other sklearn.naive_bayes
classes.
##2. Preprocessing the training data for our classifier
The training data (the HAM and SPAM emails) have been
provided for you in the script.
spam_emails=["Hello send your password", "hello please click link", "click link",
"your password here", "send password"]
ham_emails=["hello reset your password", "password email", "warm hello" ]
Now, our goal is to
process the data into a format accepted by the BernoulliNB
class.
If we navigate to the [documentation for BernoulliNB]
(http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html)
we can see that the fit function accepts as an argument a Numpy matrix where
each sample is represented by a row and a Numpy array of class labels (let’s
denote SPAM by 0 and HAM by 1).
- Concatenate the SPAM and HAM emails into one list for easy processing
training_data=spam_emails+ham_emails
- Create a one-dimensional Numpy array of class labels
We have to assign a class label (0 or 1) to each email in the training_data
list we created above. An easy way to do this is to use a mixture
of repetition and list concatenation. An example is shown below, but
you are welcome to come up with your own.
training_labels=np.array([0]*len(ham_emails)+[1]*len(spam_emails))
- Create a a Numpy feature matrix
To save time, you can create the feature matrix using the
create_scikit_matrix
method in the text-adapter module.
Unless you used a different import statement, do
not forget to tell Python that the method create_scikit_matrix
lives in the text-adapter module.
Do this by calling the method as follows
text_adapter.create_scikit_matrix
.
training_data_matrix=text_adapter.create_scikit_matrix(training_data)
##3. Creating and training our classifier
Now comes the really fun part! So far, we have
preprocessed our data and we have a Numpy matrix with our
feature_vectors and a Numpy array with class labels. We are ready to
do some machine learning!
First, let’s use the Scikit-learn API to create an instance
of the BernoulliNB class. For future use, we will store
this object in a variable called classifier
Next, train the classifier by using our training data and training class labels.
classifier.fit(training_data_matrix, training_labels)
Hooray!
Test your classifier on a new incoming email
test_email="hello please send password"
To compute the posterior probability that this email is spam,
we have to first convert it to a feature vector. As a reminder,
we are using the words that occur in the training data set as our features.
If the word is present in the email, we give it a value of 1. Else
the value is 0.
#compute the dictionary of words based on the training set we defined above
training_dictionary=text_adapter.dictionary_builder(training_data)
test_feature_vector=np.array(text_adapter.binary_feature_vector_builder(training_dictionary,test_email))
In Scikit-learn, probability of a datapoint belonging to a certain
class (in our case HAM or SPAM) can be calculated by calling
the method predict_proba()
and passing the feature vector as an argument.
print classifier.predict_proba(test_feature_vector)
If you execute the statement above, you will notice that it returns
two probabilities. The first probability is for class 0 (SPAM) and
the second for class 1 (HAM).
02 Nov 2014
Version 1.0 (last updated on Nov 8th, 2014)
Welcome back to the second part of the quick Python overview.
In this section, we will cover the basics of if-else
statements, while and for loops and creating methods. Once again, all
comments and questions are encouraged and very welcome.
Please feel free to email me at camillamon[at]gmail.com.
##2.0 Control-flow in Python programs
If-else statements and for and while loops
form the core of control-flow in Python programs.
Let’s take a brief tour of these structures, starting
with the if-else statement.
###2.1 If and else
Typically in a more complicated program, we have
to take different actions depending on some previous result.
For example, suppose we have a number and we want
to check whether it is divisible by 5.
One very simple way to do this is to use the if-statement.
The syntax of the if-statement in Python is as follows:
if test:
do something
else:
do something else
Here is a small example illustrating the concept:
random_number=10
if random_number%5==0:
print "Divisible by 5"
else:
print "Not divisible by 5"
Let’s step through the example above line by line.
In the first line, we create the number object (‘10’) and
give it the name random_number
. Next, we want
to find out if, the object that the name random_number
points to is divisible by 5 or not. In order to do this,
we employ the modulus operation, which gives us the remainder
of random_number
divided by 5. If the remainder is 0 (ie. random_number
is divisible by 5), then we will print out “Divisible by 5”.
If you want, run this program using the interactive Python shell
or the online Python shell.
Which one of the statements is printed to the terminal?
####2.1.1 A more complicated example
In the little program above, we only tested for one condition (whether
or not the number we stored in random_number
is divisible by 5).
Usually in real programming life, testing for one condition
is not enough for what we aim to achieve. So let’s take
our little example one step further.
Suppose that we still want to check whether the number
stored in random_number
is divisible by 5. If no, we
want to check whether it is divisible by 3. This
means that our program has to branch into three different
‘logical ‘ branches.
random_number
is divisible by 5
random_number
is not divisible by 5 but is divisible by 3
random_number
is neither divisible by 5 nor divisible by 3
In Python, this could be achieved in the following manner:
random_number=10 #or assign a number of your choice
if random_number%5==0:
print "The number is divisible by 5!"
elif random_number%3==0:
print "The number is divisible by 3!"
else:
print "The number is divisible neither by 5 nor 3 "
##Excercises:
- Find out if the year 1044 is a leap year.
A year is a leap year, if it is divisible by 4 and 400,
but not divisible by a 100.
###2.2 The while loop
Now that we are familiar with if
, else
and elif
statements, can take a look at the while
loop.
The while
loop executes while some condition is true and
is especially useful if we want to execute a block of code repeatedly.
Let’s illustrate this with a simple example.
Suppose we want to print out all of the numbers from 1 to 10.
number=1 #the initial number
while number<11:
print number
number+=1
As we can see from the example above, a while
statement is declared with the following syntax
The loop will keep executing until the test
becomes false.
In the little number printing example, the test in the while
statement evaluates whether the number object referenced
by number
is less than 11. If yes, the statements
inside the while
block are executed.
####A Word of Warning: Do not write infinite loops!
When writing your first while
loops, it’s easy to forget
to make sure that the loop terminates.
What would happen if we leave out the statement number+=1
from the while
loop we wrote above?
###Exercises:
- Write a small program that checks the numbers from 1 to 25 and
prints only those that are divisible by 5
###2.3 For loops
The for
‘loop is a close cousin of the while
loop.
It is design to iterate (or step through) items for example
in a list or string. Let’s look at the general syntax of the for
loop.
#general syntax for a for-loop
for element in object:
execute code here
The for
loop begins with a
header similar to do that of the while
loop.
There is one key difference, though. The header for the for
-loop
also includes something called an assignment target which we called
element
in the script above.
You can think of the assignment target as a box.
When we loop through an iterable object such as a list,
every element takes a turn jumping into the box. While
the element ‘lives in the box’, we can carry out operations on it.
If all of this seems nebulous right now, do not worry!
We will make all of this concrete by working through
a for
loop example.
We are given a list of elements (these may be strings
or numbers of a mixture of both) and we have to print out
each element.
random_elements=["apple", "Jack", 12, 1+8, "athlete"]
#let's print out each element using the for-loop
for element in random_elements:
print element
"""
#The output should be :
apple
Jack
12
9
athlete
"""
It does not matter what we call the assignment target.
Instead of element
, we could have called it word
or
chocolatebar
or even simply x
.
Now it’s time to write your own for
-loop.
###Exercises:
- Searching for a string in a list:
You are given the following list (copy and paste this
into your Python shell or text editor)
strings=['absb', 'hello', 'hghgtjk', 'apples', 'icecream']
Use a for
-loop to iterate through the list. When you
encounter the string “apples”, print out the words
“I found apples” on the console.
01 Nov 2014
Hello and welcome to part 1 of the quick Python overview! This
is a very basic tutorial that will quickly allow you to learn enough Python to
attend the Intro to ML with Scikit-learn workshop. For the purpose
of this tutorial, you do not have to install Python. You can do all of the exercises
in the Online Python shell).
If you run into any trouble or you find that a concept is wrong or
poorly explained, please do not hesitate to contact me
at camillamon[at]gmail.com. I’ll try to get back to you as soon as I can!
There is also a list of alternative Python resources at the end of this
tutorial.
##1. Python and Objects
One of the cornerstones of machine learning is (you guessed it!) manipulating
data. In Python, data is manipulated using objects such as numbers, strings,
lists and dictionaries. Some of these objects are built-in, others come
from external libraries. You can also define your own objects.
###1.0 Examples of built-in Python Objects
#numbers
15
1234.9
#strings
"hello world"
#lists
my_numbers=[1,4,5,6]
nested_list=[1,[1,2]]
empty_list=[]
#tuples
simple_tuple=(9,10)
#dicts
my_empty_dictionary={}
my_nonempty_dictionary={'song':"Etude 9",
'duration':10}
###1.1 Assigning names to objects
Suppose we have a program that prints “hello world”
several times. We could simply type “hello world”
every time we want to print it.
print "hello world"
#do something else
print "hello world"
#do something else
print "hello world"
Instead of typing “hello world” every time, we can assign
a name to “hello world”. In Python, assigning a name to an
object is done using the “=” symbol. We can then
rewrite the little program from above as follows
hello_string="hello world"
print hello_string
#do something else
print hello_string
#do something else
print hello_string
Giving an object a name and then referring to the object by a name will
make it very easy for us to change the program. Suppose that instead
of “hello world”, we want to print “hello everyone”. In the
first version of this program, we would have to change
every instance of “hello world” (tedious and you might miss some and
break your program), but in the second version we have to change the
string only once.
You can assign names
to any object types (number, list, string etc.)
#basic examples with strings, numbers and dicts
title="Do Androids Dream of Electric Sheep"
year_of_publication=1968
my_favorite_books={'Donna Tart':'The Goldfinch',
'Sylvia Plath':'The Bell Jar'}
####A Brief Note on Variable Names
It is good practice to give you variables descriptive names. This
will save you from having a ‘WTF is this thing here’ moment
when you come back to your code several weeks or months later.
Python is quite relaxed about variable naming rules, but
there are a few NO-NOs:
- Do not start a variables with numbers
- Do not use one of Python’s reserved keywords as a variable name
- Do not use symbols such as ‘@’in variable names
###Exercises:
- Create a string object with the name of your favorite novel and assign
it to a variable with a descriptive name
- Create a list with some of your favorite numbers and give it the name
my_favorite_numbers
- Create a dictionary using whatever keys and value you prefer
###1.3 String operations
In the machine learning workshop, we will manipulate textual data.
Thus it is a good idea to go over some very basic string operations, which
are built into Python.
#how do I create a string?
dna="ACGTGTCGTGTGTGTG"
#how do I find out the length of a string?
len(dna)
#12
#how do I obtain the first letter of a string (or nucleotide for the biologists among us!)
dna[0] #note indexing starts at zero
#'A'
#how do I obtain the last letter of a string?
dna[-1]
#'G'
#or
dna[len(dna)-1] #QUIZ: would the command dna[len(dna)] work? Why or why not?
#'G'
# how do I obtain a substring? (also known as slicing)
dna[0:3]
#'ACG'
#QUIZ: what does the operation dna[-2] return?
###Exercises:
- Create a string in the Online Python interpreter(or your own interactive
Python session) and give it a descriptive name (eg.
my_string
for
those of us lacking imagination :D)
- Obtain the first 4 characters of
my_string
- Find out the length of
my_string
- Create a string object with the value “hello world” by concatenating
two string objects, “hello” and “world”.
###1.4 Basic list operations and methods
Lists allow us to store a group of related objects (strings, numbers etc)
together.
#how do I create a list?
favorite_ice_cream=["vanilla", "chocolate", "strawberry"]
prices=[12, 56, 78]
#how do I access an element in the list?
favorite_ice_cream[0]
#'vanilla'
#how do I find out the numbers of elements in a list?
len(favorite_ice_cream)
#3
#how do I concatenate two lists?
favorite_ice_cream+prices
#['vanilla', 'chocolate', 'strawberry', 12, 56, 78]
# as we can see from above,
#Python allows you to have lists with objects of different types (ie. numbers and strings)
In addition to the basic operations illustrated above, lists come
with several predefined methods.
#how do I add an element to the end of a list?
favorite_ice_cream.append("cherry garcia")
#how do I delete an item at position n (where n is the index of the element you want to delete)?
favorite_ice_cream.pop(0)
favorite_ice_cream
#['chocolate', 'strawberry', 'cherry garcia']
#the object 'vanilla' was deleted from the list
###Exercises:
- Create a list with your favorite desserts and give it a meaningful name
- Print the length of
my_random_list
given by the expression
my_random_list=range(1,10)
Find out more about the range()
function by navigating to the Python
docs page.
###1.5 Basic dictionary operations and methods
Sometimes we want to associate particular keys with values.
For example, a company may want to store some basic information
about its employees.
#how do I create a dict object?
employee={
'name':'Jane Doe',
'department':'engineering',
'salary':300000,
}
#how do I access the value associated with a key?
employee['name']
#'Jane Doe'
#how do I obtain a list of all keys?
employee.keys()
#['department', 'salary', 'name']
#how do I add another key-value pair?
employee['programming_language']="Python"
#check that the new key has been added by printing the keys
employee.keys()
#['department', 'salary', 'programming_language', 'name']
#note that the keys may be returned in a different order
###Exercises:
- Create a dict object for employee with the name “John Smith”
- Populate it with key-value pairs of your choice
##Other resources for learning Python