thoughtwisps One commit at a time

Hello and welcome to thoughtwisps! This is a personal collection of notes and thoughts on software engineering, machine learning and the technology industry and community. For my professional website, please see race-conditions. Thank you for visiting!

the software engineering notebook

Fellow software engineers/hackers/devs/code gardeners, do you keep a notebook (digital or plain dead-tree version) to record things you learn?

Since my days assembling glassware and synthesizing various chemicals in the organic chemistry lab, I’ve found keeping notes to be an indispensable tool at getting better and remembering important lessons learned. One of my professors recommended writing down, after every lab sessions, what had been accomplished and what needed to be done next time. When lab sessions are few and far apart (weekly instead of daily), it is easy to forget the details (for example, the mistakes that were made during weighing of chemicals ). A good quick summary helps with this!

When I first started working for a software company, I was overwhelmed. Academic software development was indeed very different to large scale distributed software development. For example, the academic software I wrote was rarely version controlled and had few tests. I had never heard of a ‘build’ or DEV/QA/PROD environments, not to mention things like Gradle or Jenkins. The academic software I worked on was distributed in zip files and usually edited by only one person (usually the original author). The systems I started working on were simultaneously developed by tens of developers across the globe.

To deal with the newbie developer info-flood, I went back to the concept of a ‘software engineering lab notebook’. At first, I jotted down commands needed to setup proper compilation flags for the dev environment and how to run the build locally to debug errors. A bit later, I started jotting down diagrams of the internals of the systems I was working on and summaries of code snippets that I had found particularly thorny to understand. Sometimes these notes proved indispensable in under-stress debug scenarios when I needed to quickly revisit what was happening in a particular area of the codebase without the luxury of a long debug.

In addition to keeping a record of things that can make your development and debug life easier, a software engineering lab notebook can serve as a good way to learn from previous mistakes. When I revisit some of the code I wrote a year ago or even a few months ago, I often cringe. It’s the same feeling as when you read a draft of a hastily written essay or work of fiction and then approach it again with fresh eyes. All of the great ideas suddenly seem - well- less than great. For example, recently I was looking at a server side process that I wrote to perform computations on a stream of events (coming via ZeroMQ connection from another server ) and saw that for some reason I had included a logging functionality that looped through every single item in an update (potentially 100s ) and wrote a log statement with the data! Had the rate of events been higher, this could have caused some performance issues, though the exact quantification of the impact still remains an area where I need to improve. Things such as these go into the notebook to the ‘avoid-in-the-future-list’.

decision boundary poisoning - a black box attack on a linear SVM

Introduction

If you regularly browse machine learning websites, you may have seen the image of a self-driving car baffled by a circle of salt drawn on the ground. This ‘hack’ on the car’s sensing devices shows that there is still some work to do to make sure that machine learning algorithms are robust to malicious (or accidental) data manipulation.

Sarah Jamie Lewis’ post on adversarial machine learning is a great introduction and bibliography on the topic of machine learning. One of the papers the article links to is ‘Can Machine Learning be Secure?’ by Barreno et al.

Barreno et al describe various ways to detect attacks and among them talk about examining points near the decision boundary. A large cluster of points around the boundary might indicate that an exploratory attack is taking place.

I wanted to take this idea of points at the decision bounary and explore how one could force a Linear Support Vector Machine classifier trained on the famous Iris dataset to misclassify a rogue point.

Attack description

This attack with premises that make it largely unrealistic. For example, the attacker in this case has full knowledge of the dataset, can visualise the decision boundary and can force the classifier to retrain at will. This most likely will never happen in the real world. This is a blackbox attack, meaning we assume that the attacker does not know anything about the internals of the Linear SVM classifier and its training process.

  1. Begin with a classifier trained on the Iris dataset to distinguish between the Iris setosa and Iris versicolor species
  2. The attacking class will be Iris versicolor. We will inject a rogue point into the Iris setosa dataset and then poison the training data until this rogue point is classified as Iris versicolor.

Prepare the training data

import sklearn
import pandas as pd
import matplotlib
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None)
df.head()
0 1 2 3 4
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
#prepare the class labels. Iris setosa will be labelled as -1, Iris versicolor as 1
X = df.iloc[0:100, 4].values
y= np.where(X=="Iris-setosa", -1, 1)
#prepare the training data
data = df.iloc[0:100, [0,2]].values

Visualise training data

As we can see from the image below, the data points (whose locations are based on the petal length and sepal length of the two species of Iris ) are linearly separable - so we should be able to learn a good decision boundary with the Linear SVM.

plt.scatter(data[:50, 0], data[:50, 1], color='red', marker='o', label='setosa')
plt.scatter(data[50:100, 0], data[50:100, 1], color='blue', marker='x', label='versicolor')
plt.xlabel('petal length')
plt.ylabel('sepal length')
plt.legend(loc='upper left')
plt.show()

png

Attack tools

To monitor the progress of the attack, we will define a helper function below. The function plots the decision boundary learned from the classifier as well as the data. I implemented the function based on the exposition in Python: Deeper Insights into Machine Learning by John Hearty, David Julian and Sebastian Raschka.

def plot_decision_regions(data, y, classifier, resolution=0.02):
    """
    A function that plots decision regions based on "Implementing a perceptron algorithm in Python by Raschka et al.
    """
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])
    
    #meshgrid
    x1min, x1max = data[:,0].min()-1, data[:, 0].max()+1
    x2min, x2max = data[:,1].min()-1, data[:,1].max()+1
    xx1, xx2 = np.meshgrid(np.arange(x1min, x1max, resolution), np.arange(x2min, x2max, resolution))
    Z = clf.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha=0.4)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=data[y==cl,0], y=data[y==cl, 1], 
                   alpha=0.8, c=cmap(idx), 
                   marker=markers[idx], label=cl)
    plt.scatter(x=[6.0], y=[1.8], 
               alpha=0.8, marker='v', c='cyan')

Linear SVM classification

We will be using the Linear SVM implementation from scikit-learn.

from sklearn import svm

clf = svm.LinearSVC()
clf.fit(data, y)

Plotting the decision surface before the attack

This is the situation before we begin poisoning the decision boundary. I have added the rogue point in light blue/cyan into the red class at (6.0, 1.8). The aim will be to move the decision boundary so that this point will be misclassified as blue class.

plot_decision_regions(data, y, clf)
plt.legend(loc='upper left')
plt.show()

png

Let’s define a few helper functions: add_attack_point will merge new poisoned training data with the existing training dataset, retrain_and_plot will rerun the classifier and plot the resulting decision boundaries.

def add_attack_point(data, y, attack_points, attack_class=1):
    """
    Add a new attack point to the dataset
    
    Returns:
    --------
    
    New dataset including attack point
    
    New class label vector including label for attacking class (1 in this case)
    """
    return np.concatenate((data, attack_points)), np.concatenate((y, np.array(len(attack_points)*[attack_class])))
def retrain_and_plot(clf, new_data, new_y):
    """
    Retrain the classifier with new data and plot the result
    """
    clf.fit(new_data, new_y)
    plot_decision_regions(new_data, new_y, clf)
    plt.legend(loc='upper left')
    plt.show()

Attack 1 : Add single data point and see what happens

new_data, new_labels = add_attack_point(data, y, [[6.0, 2.0]])
retrain_and_plot(clf, new_data, new_labels)

png

One of the instances from blue class now falls within the decision boundary of the red class, but this has not caused a remarkable shift in the position of the decision boundary. Let’s continue adding more points close to the rogue point we wish to re-classify as blue.

Attack 2: More poisoned datapoints

x1_new = [6.0, 6.1, 6.05, 6.08, 6.09]
x2_new = [1.5, 1.4, 1.55, 1.8, 1.75]
new_vals = map(list, zip(x1_new, x2_new))
new_data_2, new_labels_2 = add_attack_point(new_data, new_labels, new_vals)
retrain_and_plot(clf, new_data_2, new_labels_2)

png

The decision boundary is clearly shifting - close to the mass of new blue attack points.

Attack 3

x1_new = [5.5, 5.6, 5.559, 5.7, 5.45, 6.0, 6.1, 6.2, 5.9]
x2_new = [2.1, 2.105, 2.110, 2.089, 2.0, 1.8, 1.9, 2.0, 1.9]
new_vals = map(list, zip(x1_new, x2_new))
new_data_3, new_labels_3 = add_attack_point(new_data_2, new_labels_2, new_vals)
retrain_and_plot(clf, new_data_3, new_labels_3)

png

The rogue point is now classified as class 1.

rogue_point = [[6.0, 1.8]]
clf.predict(rogue_point)
array([1])

Summary

This was a brief and very incomplete experiment with changing the decision boundaries of Linear SVM classifiers by adding poisoned training data close to the decision boundary. One of the major shortcomings of this exposition is that the points were added in a largely random manner around the rogue point without any sort of attack system. In the next few essays on this subject I hope to present a white box approach to poisoning SVMs ( attacks where attacker has intricate knowledge of how support vector machines work ) as well as developing systematic data poisoning approaches.

notes on container security I

This post is a set of notes from Jess Frazelle’s talk ‘Benefits of isolation provided by containers’ delivered at O’Reilly Security 2016 Conference in New York. Notes are abbreviated in some places and in places where I was not familiar with concepts from the talk, I’ve added some of my own clarifications/definitions. All mistakes are mine.

How do containers help security?

  • they do not prevent application compromise, but can limit the damage

  • the world an attacker sees inside a container is very different than if she would be looking at an app running without a container

What is a container?

  • a group of Linux namespaces and control groups applied to a process

What are control groups (cgroups) ?

Cgroups limit what a process can use. There various kinds of cgroups: for example, the memory cgroup limits what physical or kernel memory. Another example, the blkio group limits various I/O operations. cgroups are controlled by files

What are namespaces ?

Namespaces limit what a process can see. They are files located in /proc/{pid}/ns

Docker and defaults

  • by default Docker blocks writing to /proc/{num} and /proc{sys}, mounting and writing to sys

  • LSM (Linux Security Modules ) - a framework that allows the Linux kernel to support a variety of security modules such as AppArmor, SELinux. Docker supports LSMs.

  • people generally don’t want to write custom AppArmor custom profiles, syntax is not great (note to self: look up how to write AppArmor profiles )

Docker and seccomp

  • seccomp is a Linux kernel security module

  • Docker has a default whitelist, which prevents for example cloning a user namespace inside a Docker container ( note to self: look up why this is a weak spot for Linux kernel vulns )

Docker security in the future

  • one sad thing: containers need to run as root
  • why: to write to sys/fs for cgroups, we need to be root

  • cgroups cannot be created without CAP_SYS_ADMIN

  • cgroups namespaces: a potential solution to having to be root when creating a cgroup, but turns out this concept just limits what cgroups you can see

  • there are patches in the kernel to make it possible to create cgroups with other users, but not yet

disconnect

Trying not to fall into a rabbit hole is harder than it should be.

Neurons, cut off from the constant stream of novel fast-food information, are struggling to adjust to a life outside of the internet noosphere.

But the brain is a pliable creature. Slowly, it rewires itself to assign neurons performing information grazing to other tasks, one new connection at a time. It does not feel pleasant and the desire to connect back to the buzz of thousands of internet voices is always present.

The info-FOMO (the fear that the biggest and most important thing will happen on the interwebs while you’re away) is always there, but I can learn to live with it until it fades. Also, I won’t be able to lookup Game of Thrones spoilers, so someone will have to tell me what ultimately happens between Daenerys and Jon Snow.

I have not pulled a complete Aziz Ansari ( the GQ Fall 2017 interview with the comedian reports that he has completed disconnected and removed his browsers from his laptop and phone) yet, but I’ve updated /etc/hosts to have even more entries and am considering permanently switching to lynx. It’s a text based browser and thus hopefully will encourage swift and focused information lookups instead of mindless surfing.

seven questions about technology

When you visit the tallest floor of a London skyscraper, you realise that perspective matters. From the 39th floor, London appears to be nothing more but a whimsically assembled menagerie of various shapes clustered on the banks of a single ribbon of blue, the Thames, bending around the tongue-like Isle of Dogs on its way to the City.

I sometimes wonder what it would feel like to look at time in this way, with perspective and distance.

It sometimes seems infuriating that time is a dimension that cannot be examined in both directions. We can only look back and hypothesize about the future, but there is no skyscraper we can climb which will show us the whole view.

The future humans will look at us, our buildings and customs and cultures, and wonder why we made the mistakes we did. Perhaps, they will try to walk a bit in our shoes to too see why our choices appeared obvious.

They will look at our tablets and smart phones the way we look at floppy disks, VHS tapes and cassettes, with a wry smile and maybe an eye-roll ( full disclosure: I am a child of the floppy-disk age, but even I rolled eyes at cassettes )

Why should we let posterity have all the fun? Even though every generation is, in a way, blind to the shortcomings and dangers of the technology du-jour, that should not stop us from taking a critical look at what is happening.

Neil Postman, an influential voice on this topic, was a cultural critic and professor, a historian of technology and media. In a March 1997 speech “The Surrender of Culture to Technology” (it is available on Youtube) Postman outlined seven questions that we can use to evaluate a new technology.

Before looking at the questions, we should make note of an important distinction between the words media and technology. A technology to a medium is what a brain is to a mind, Postman says. A technology is machine, a medium is a social creation. How a technology is used by a culture is not necessarily the only way it could be used.

The seven questions about technology:

  1. What is the problem to which this technology is a solution? There are technologies that are not solutions to any problem.

  2. Whose problem is this? Most technologies do solve a problem. Who will benefit from this technology and who will pay for it? These are sometimes not the same people.

  3. What new problems might be created, because we have solved the old problem? Technologies generate new problems, but sometimes it is hard to know what new problems will be. For example, Postman argued, that the television, while allowing for mass-communication and mass-entertainment, had permanently changed the nature of political discourse. It would have been fascinating to hear his take on Twitter in November 2016.

  4. Which people and what institutions might be most seriously harmed by this technology?

  5. What changes in language are being enforced by new technologies? What is lost and what is gained?

  6. What sort of institutions acquire special economic and political power because of technological change?

  7. What alternative uses might be made of a technology? What alternative media might arise from this technology?

Postman argued that it was not inevitable that television (the physical technology ) became the commercial television we all know today. He cited examples of countries where (in the 90s) television was not subject to any commercial interests. The crux of the argument is that the medium (the current world wide web for example) that exists of a particular technology (the physical internet network) is not the only possible medium we could have created. How a particular technology is transformed into a particular medium is very complex and involves society, politics and greed.