thoughtwisps One commit at a time

Hello and welcome to thoughtwisps! This is a personal collection of notes and thoughts on software engineering, machine learning and the technology industry and community. For my professional website, please see race-conditions. Thank you for visiting!

Introductory Tutorial

Note: Previously there was an error in the label generation step. The spam examples were labeled ‘1’ and the ham examples ‘0’. This has now been corrected

Prerequisites:

  • Scikit-learn
  • the [Tutorial kit] (https://github.com/Winterflower/mockmail-intro/releases/tag/v1.0)
  1. Download the Tutorial kit and unzip it
  2. Change to the root of the Tutorial kit directory
  3. Open the the file introductory_tutorial.py using your favorite text editor (I quite like Atom, but you are welcome to use anything you like)

##1. Importing required modules For this tutorial, we will require three modules

  1. numpy
  2. sklearn.naive_bayes
  3. text_adapter

Import these into your Python script.

import numpy as np  #too lazy to type numpy every time
import text_adapter
from sklearn.naive_bayes import BernoulliNB

In the last statement we choose to import only the class BernoulliNB, because we will not be needing the other sklearn.naive_bayes classes.

##2. Preprocessing the training data for our classifier The training data (the HAM and SPAM emails) have been provided for you in the script.

spam_emails=["Hello send your password", "hello please click link", "click link",
"your password here", "send password"]
ham_emails=["hello reset your password", "password email", "warm hello" ]

Now, our goal is to process the data into a format accepted by the BernoulliNB class.

If we navigate to the [documentation for BernoulliNB] (http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html) we can see that the fit function accepts as an argument a Numpy matrix where each sample is represented by a row and a Numpy array of class labels (let’s denote SPAM by 0 and HAM by 1).

  • Concatenate the SPAM and HAM emails into one list for easy processing

training_data=spam_emails+ham_emails
  • Create a one-dimensional Numpy array of class labels

We have to assign a class label (0 or 1) to each email in the training_data list we created above. An easy way to do this is to use a mixture of repetition and list concatenation. An example is shown below, but you are welcome to come up with your own.

training_labels=np.array([0]*len(ham_emails)+[1]*len(spam_emails))
  • Create a a Numpy feature matrix

To save time, you can create the feature matrix using the create_scikit_matrix method in the text-adapter module.

Unless you used a different import statement, do not forget to tell Python that the method create_scikit_matrix lives in the text-adapter module.

Do this by calling the method as follows text_adapter.create_scikit_matrix.

training_data_matrix=text_adapter.create_scikit_matrix(training_data)

##3. Creating and training our classifier Now comes the really fun part! So far, we have preprocessed our data and we have a Numpy matrix with our feature_vectors and a Numpy array with class labels. We are ready to do some machine learning!

First, let’s use the Scikit-learn API to create an instance of the BernoulliNB class. For future use, we will store this object in a variable called classifier

classifier=BernoulliNB()

Next, train the classifier by using our training data and training class labels.

classifier.fit(training_data_matrix, training_labels)

Hooray!

Test your classifier on a new incoming email

test_email="hello please send password"

To compute the posterior probability that this email is spam, we have to first convert it to a feature vector. As a reminder, we are using the words that occur in the training data set as our features. If the word is present in the email, we give it a value of 1. Else the value is 0.

#compute the dictionary of words based on the training set we defined above
training_dictionary=text_adapter.dictionary_builder(training_data)
test_feature_vector=np.array(text_adapter.binary_feature_vector_builder(training_dictionary,test_email))

In Scikit-learn, probability of a datapoint belonging to a certain class (in our case HAM or SPAM) can be calculated by calling the method predict_proba() and passing the feature vector as an argument.

print classifier.predict_proba(test_feature_vector)

If you execute the statement above, you will notice that it returns two probabilities. The first probability is for class 0 (SPAM) and the second for class 1 (HAM).

A very quick and simple introduction to Python (part 2)

Version 1.0 (last updated on Nov 8th, 2014)

Welcome back to the second part of the quick Python overview. In this section, we will cover the basics of if-else statements, while and for loops and creating methods. Once again, all comments and questions are encouraged and very welcome. Please feel free to email me at camillamon[at]gmail.com.

##2.0 Control-flow in Python programs If-else statements and for and while loops form the core of control-flow in Python programs. Let’s take a brief tour of these structures, starting with the if-else statement.

###2.1 If and else Typically in a more complicated program, we have to take different actions depending on some previous result. For example, suppose we have a number and we want to check whether it is divisible by 5. One very simple way to do this is to use the if-statement.

The syntax of the if-statement in Python is as follows:

if test:
  do something
else:
  do something else

Here is a small example illustrating the concept:

random_number=10
if random_number%5==0:
  print "Divisible by 5"
else:
  print "Not divisible by 5"

Let’s step through the example above line by line. In the first line, we create the number object (‘10’) and give it the name random_number. Next, we want to find out if, the object that the name random_number points to is divisible by 5 or not. In order to do this, we employ the modulus operation, which gives us the remainder of random_number divided by 5. If the remainder is 0 (ie. random_number is divisible by 5), then we will print out “Divisible by 5”.

If you want, run this program using the interactive Python shell or the online Python shell. Which one of the statements is printed to the terminal?

####2.1.1 A more complicated example In the little program above, we only tested for one condition (whether or not the number we stored in random_number is divisible by 5). Usually in real programming life, testing for one condition is not enough for what we aim to achieve. So let’s take our little example one step further.

Suppose that we still want to check whether the number stored in random_number is divisible by 5. If no, we want to check whether it is divisible by 3. This means that our program has to branch into three different ‘logical ‘ branches.

  • random_number is divisible by 5
  • random_number is not divisible by 5 but is divisible by 3
  • random_number is neither divisible by 5 nor divisible by 3

In Python, this could be achieved in the following manner:

random_number=10 #or assign a number of your choice
if random_number%5==0:
  print "The number is divisible by 5!"
elif random_number%3==0:
  print "The number is divisible by 3!"
else:
  print "The number is divisible neither by 5 nor 3 "

##Excercises:

  1. Find out if the year 1044 is a leap year. A year is a leap year, if it is divisible by 4 and 400, but not divisible by a 100.

###2.2 The while loop Now that we are familiar with if, else and elif statements, can take a look at the while loop. The while loop executes while some condition is true and is especially useful if we want to execute a block of code repeatedly. Let’s illustrate this with a simple example. Suppose we want to print out all of the numbers from 1 to 10.

number=1   #the initial number
while number<11:  
  print number
  number+=1

As we can see from the example above, a while statement is declared with the following syntax

while test:
  do something

The loop will keep executing until the test becomes false. In the little number printing example, the test in the while statement evaluates whether the number object referenced by number is less than 11. If yes, the statements inside the while block are executed.

####A Word of Warning: Do not write infinite loops! When writing your first while loops, it’s easy to forget to make sure that the loop terminates. What would happen if we leave out the statement number+=1 from the while loop we wrote above?

###Exercises:

  1. Write a small program that checks the numbers from 1 to 25 and prints only those that are divisible by 5

###2.3 For loops The for ‘loop is a close cousin of the while loop. It is design to iterate (or step through) items for example in a list or string. Let’s look at the general syntax of the for loop.

#general syntax for a for-loop
for element in object:
  execute code here

The for loop begins with a header similar to do that of the while loop. There is one key difference, though. The header for the for-loop also includes something called an assignment target which we called element in the script above.

You can think of the assignment target as a box. When we loop through an iterable object such as a list, every element takes a turn jumping into the box. While the element ‘lives in the box’, we can carry out operations on it.

If all of this seems nebulous right now, do not worry! We will make all of this concrete by working through a for loop example.

We are given a list of elements (these may be strings or numbers of a mixture of both) and we have to print out each element.

random_elements=["apple", "Jack", 12, 1+8, "athlete"]

#let's print out each element using the for-loop
for element in random_elements:
  print element

"""
#The output should be :
apple
Jack
12
9
athlete
"""

It does not matter what we call the assignment target. Instead of element, we could have called it word or chocolatebar or even simply x.

Now it’s time to write your own for-loop.

###Exercises:

  1. Searching for a string in a list: You are given the following list (copy and paste this into your Python shell or text editor)
strings=['absb', 'hello', 'hghgtjk', 'apples', 'icecream']

Use a for-loop to iterate through the list. When you encounter the string “apples”, print out the words “I found apples” on the console.

A very quick and simple introduction to Python (part 1)

Hello and welcome to part 1 of the quick Python overview! This is a very basic tutorial that will quickly allow you to learn enough Python to attend the Intro to ML with Scikit-learn workshop. For the purpose of this tutorial, you do not have to install Python. You can do all of the exercises in the Online Python shell).

If you run into any trouble or you find that a concept is wrong or poorly explained, please do not hesitate to contact me at camillamon[at]gmail.com. I’ll try to get back to you as soon as I can! There is also a list of alternative Python resources at the end of this tutorial.

##1. Python and Objects One of the cornerstones of machine learning is (you guessed it!) manipulating data. In Python, data is manipulated using objects such as numbers, strings, lists and dictionaries. Some of these objects are built-in, others come from external libraries. You can also define your own objects.

###1.0 Examples of built-in Python Objects

#numbers
15
1234.9

#strings
"hello world"

#lists
my_numbers=[1,4,5,6]
nested_list=[1,[1,2]]
empty_list=[]

#tuples
simple_tuple=(9,10)

#dicts
my_empty_dictionary={}
my_nonempty_dictionary={'song':"Etude 9",
                        'duration':10}

###1.1 Assigning names to objects Suppose we have a program that prints “hello world” several times. We could simply type “hello world” every time we want to print it.

print "hello world"
#do something else
print "hello world"
#do something else
print "hello world"

Instead of typing “hello world” every time, we can assign a name to “hello world”. In Python, assigning a name to an object is done using the “=” symbol. We can then rewrite the little program from above as follows

hello_string="hello world"
print hello_string
#do something else
print hello_string
#do something else
print hello_string

Giving an object a name and then referring to the object by a name will make it very easy for us to change the program. Suppose that instead of “hello world”, we want to print “hello everyone”. In the first version of this program, we would have to change every instance of “hello world” (tedious and you might miss some and break your program), but in the second version we have to change the string only once.

You can assign names to any object types (number, list, string etc.)

#basic examples with strings, numbers and dicts
title="Do Androids Dream of Electric Sheep"
year_of_publication=1968
my_favorite_books={'Donna Tart':'The Goldfinch',
                    'Sylvia Plath':'The Bell Jar'}

####A Brief Note on Variable Names It is good practice to give you variables descriptive names. This will save you from having a ‘WTF is this thing here’ moment when you come back to your code several weeks or months later. Python is quite relaxed about variable naming rules, but there are a few NO-NOs:

  1. Do not start a variables with numbers
  2. Do not use one of Python’s reserved keywords as a variable name
  3. Do not use symbols such as ‘@’in variable names

###Exercises:

  1. Create a string object with the name of your favorite novel and assign it to a variable with a descriptive name
  2. Create a list with some of your favorite numbers and give it the name my_favorite_numbers
  3. Create a dictionary using whatever keys and value you prefer

###1.3 String operations In the machine learning workshop, we will manipulate textual data. Thus it is a good idea to go over some very basic string operations, which are built into Python.

#how do I create a string?
dna="ACGTGTCGTGTGTGTG"

#how do I find out the length of a string?
len(dna)
#12

#how do I obtain the first letter of a string (or nucleotide for the biologists among us!)
dna[0]   #note indexing starts at zero
#'A'

#how do I obtain the last letter of a string?
dna[-1]
#'G'

#or
dna[len(dna)-1]  #QUIZ: would the command dna[len(dna)] work? Why or why not?
#'G'

# how do I obtain a substring? (also known as slicing)
dna[0:3]
#'ACG'
#QUIZ: what does the operation dna[-2] return?

###Exercises:

  1. Create a string in the Online Python interpreter(or your own interactive Python session) and give it a descriptive name (eg. my_string for those of us lacking imagination :D)
  2. Obtain the first 4 characters of my_string
  3. Find out the length of my_string
  4. Create a string object with the value “hello world” by concatenating two string objects, “hello” and “world”.

###1.4 Basic list operations and methods Lists allow us to store a group of related objects (strings, numbers etc) together.

#how do I create a list?
favorite_ice_cream=["vanilla", "chocolate", "strawberry"]
prices=[12, 56, 78]

#how do I access an element in the list?
favorite_ice_cream[0]
#'vanilla'

#how do I find out the numbers of elements in a list?
len(favorite_ice_cream)
#3

#how do I concatenate two lists?
favorite_ice_cream+prices
#['vanilla', 'chocolate', 'strawberry', 12, 56, 78]

# as we can see from above,
#Python allows you to have lists with objects of different types (ie. numbers and strings)

In addition to the basic operations illustrated above, lists come with several predefined methods.

#how do I add an element to the end of a list?
favorite_ice_cream.append("cherry garcia")

#how do I delete an item at position n (where n is the index of the element you want to delete)?
favorite_ice_cream.pop(0)
favorite_ice_cream
#['chocolate', 'strawberry', 'cherry garcia']
#the object 'vanilla' was deleted from the list

###Exercises:

  1. Create a list with your favorite desserts and give it a meaningful name
  2. Print the length of my_random_list given by the expression
my_random_list=range(1,10)

Find out more about the range() function by navigating to the Python docs page.

###1.5 Basic dictionary operations and methods Sometimes we want to associate particular keys with values. For example, a company may want to store some basic information about its employees.

#how do I create a dict object?
employee={
          'name':'Jane Doe',
          'department':'engineering',
          'salary':300000,
          }
#how do I access the value associated with a key?
employee['name']
#'Jane Doe'

#how do I obtain a list of all keys?
employee.keys()
#['department', 'salary', 'name']

#how do I add another key-value pair?
employee['programming_language']="Python"


#check that the new key has been added by printing the keys
employee.keys()
#['department', 'salary', 'programming_language', 'name']

#note that the keys may be returned in a different order

###Exercises:

  1. Create a dict object for employee with the name “John Smith”
  2. Populate it with key-value pairs of your choice

##Other resources for learning Python