race_conditions Coding and running

Data science with Python and RxPy - First steps and notes

The world is full of data that changes or 'ticks' over time - prices of financial instruments on the market, quality of air over a given day, the number of Twitter followers a particular user has (I'm sure you can come up with many more!). How to best capture and analyse this data? One way of engineering data systems around contantly changing streams of data is to design reactive event-driven systems. That is a mouthful that probably has as many definitions as there are practitioners, so it might be best to examine what components are needed to construct reactive event-driven systems.

To build the basics of a reactive system we need Observers ( objects that perform some action - 'react'- when a piece of data they are interested in ticks ) and Observables, which represent the streams of data. In essence, we could characterise this system as a publish-subscribe system: the Observables publish streams of data to all interested Observer classes.

Let's make all of this concrete by implementing a simple example using the RxPy library and Python3. Suppose the air quality of a specific area is measured using an index that can take values from 1 to 10. Let's design an Observer that subscribes to this stream of mocked air quality data and emits warnings based on the value of the index. The Observer that we need to write should inherit from the RxPy library's Observer class and implement three functions:

  • on_next which is triggered when an event from an Observable is emitted

  • on_completed which is called when an Observable has exhausted it's stream of data (there are no more events)

  • on_error which is triggered when something goes wrong

from rx import Observer, Observable
from numpy import random

class DataAnalyser(Observer):
    #a class for analyzing data
    def on_next(self, value):
        if value<=3:
            print('Safe to enjoy the outdoors!')
            print('Air pollution is high in your area - please take care!')
    def on_completed(self):
        print('Finished analyzing pollution data')
    def on_error(self, error):
        print('Something went wrong in data analysis')

To complete this example, we also need a an Observable (our stream of mock air quality data ). We can create one very easily using the RxPy Observable class. Finally, we call the Observable's subscribe method to register DataAnalyser as the objected interested in the stream of data published by the Observable.

def main():
    air_pollution_indices = Observable.from_([random.randint(1,10) for _ in range(20)])
    data_analyser = DataAnalyser()

The full sample script is available on Github Gist.

Legacy software and disadvantages of highly specialised teams

when making a small adjustment becomes an un-testable multi-team problem

Today I'd like to talk about some frustrations that arise when working on a legacy system developed by multiple remote software development teams.

A long time ago, I worked on a system that snapped some realtime ticking data, carried out a few computationally expensive calculations ( they had to be carried out on a remote server machine ) and sent the result to a user's front end. I was placed in charge of building out the infrastructure for client-server communication. The data manipulation libraries that calculated values based on data ticks were developed by an independent team and I was not given access to modify this code. Although this system had many shortcomings (most introduced by me), a particular pain point was the system of databases and APIs that had grown around the service that supplied ticking data. At the lowest level of the system was a message queue, which monitored the various tick data sources. The data on the queue was pushed into a database, which exposed a direct query API to any clients. However, the query language was more or less intractable to people without deep experience with the database software and thus yet another team was setup to develop an abstraction on top of the database API. None of the APIs were properly documented and had been designed a few years before I came along.

Thus, the correct functioning of my software depended on a 3rd party calculation library, a database managed by a remote team, a hard to use database API managed by the same remote team and a another abstraction built on top of the database API to 'make things easier' for the ultimate client applications (such as the system I developed ). At first the components of the system cooperated on a fairly regular basis and eventually thousands of different datatypes were onboarded to the same system. Eventually the vendor software that governed the values being placed into the message queue was replaced with another software that did not have the same filtration capabilities. Very soon I started seeing invalid data values propagating into my server side calculation process and the 3rd party calculation library struggled to cope with them. It did not have proper error handling to exclude bad data - since the need had never come up, no one had ever thought of developing it. This library was mission critical and used by hundreds of applications. Adding any kinds of changes would have required a testing process that would span months and potentially tens of different teams. What made matters worse, I did not have any control of the data values directly since the abstraction API on top of the database communicated directly with the 3rd party calculation library without returning any results into my server process.

My next step then was to talk to the abstraction API team, but the solutions that were offered for filtering offending data could not be implemented, because such a change would have to be carried out for all datatypes and not just the one that was causing issues for me. The API had not been designed to provide granularity based on particular data types. In addition, it was hard to convince the abstraction API team that my problem was a legitimate problem. The data value ticking on the message bus was a legitimate value, but in the context of the business it made very little sense, which is why the computational 3rd party API would have never expected it.

Eventually, the solution had to be moved into the lowest level - the database API - only after multiple discussions with multiple teams.

I think there are a few important lessons and some questions:

1) Segregating software developers into highly specialised teams produces software quickly, but the APIs delivered by such teams can easily ignore the needs of developers working on client libraries.

2) This is a hard one: but software should be designed so that it can easile be extended at a future time when requirements change. This fact alone makes comparisons between civil engineering and software engineering hard. I'd imagine that once a team of engineers decides to build a pedestrian bridge, they build a pedestrian bridge. No one will come along and say, "Hey, now your pedestrian bridge will also have to accomodate large trucks." This happens very often in software engineering when one tries to scale an application - the infrastructure that was able to support 100 users, simply won't be able to cope with 1 000 000 users.

3) What are the best practices for designing data delivery layers? What features should be a part of APIs that expose realtime data to application developers?

this week ( week 2)

It's the second week of the new year ( 2017, can you believe it? I know I'll still be writing 2016 on various paperwork for a while until muscle memory adjusts to 2017 ) and it's the time when my new year's resolutions are starting to crumble down. But fear not - one learns from failure more than from (almost) anything else!

Software engineering

I'm doing a deep dive into Python's super. I realised last year that peppering the codebase with mixins and then calling super 'like I super-certainly know what it does' was not a good idea. First, in large codebases and huge distributed teams, people tend to copy a working example without actually making sure the original author had a clue. Second, I would really like to get to grips with the background of the MRO algorithm (and maybe check out other competing implementations).

In other news, I finally have jekyll build working on my laptop, so I can check out the blog before pushing it online!


This was not a great week for running, since I mostly gave in to the various excuses a tired post-work me came up with, but on Saturday I finally ventured out to explore Regent's Canal. Although the tow path is slightly harder than the trail-like tow path of the Grand Union Canal, it was still a pleasant run and a lot less muddy than my last weekend's excursion to Wimbledon Common. The weather was stellar as well and I saw a few people with a house boat using the canal locks!

Goals for next week

  1. Continue to reclaim my attention. Disconnect from pointless time-wasting social media sites and focus on focusing.
  2. Sleep enough
  3. Write a simple talk on socket programming in Python
  4. Read through Raymond Hettinger's blog posts on super

Code Mesh Day 2

Notes on Day 2 of the Code Mesh 2016 conference in London.

Stateful stream processing

With the main keynote event moved to the afternoon, Day 2 of Code Mesh 2016 launched directly into the sessions. In the morning, I attended Streaming, Database & Distributed Systems: Bridging the Divide by Ben Stopford from Confluent. It seems that distributed systems is the hot computer science topic du jour and I, not wanting to be left out of the cool kids crowd, headed over to listen and learn from Ben's expertise in streaming and distributed systems. Ben outlined two goals for the talk: understand stateful stream processing(SSP) and argue that SSP can be a 'general framework for building data-centric systems'. There are several flavours of data-analytics systems

  • database (eg Postgres - provides a consistent view of the data )
  • analytics database (eg Hadoop, Spark - specialises in aggregations performed over large datasets )
  • messaging system ( has ephemeral state )
  • stream processing ( manipulate concurrent streams of events and perform computations on the streamed data)
  • stateful stream procesing ( a branch of stream processing )

If a database's query engine traverses a finite data table, the stream processor's query engine is designed to operate over an infinite dataset that, however, has to be bounded by a 'window'. The 'window' delineates how many 'ticks' of data are allowed into the stream processor's query engine. The query engine then executes a continous query and emits data at some frequency. This behaviour is somewhat analogous to a materialised view in a database. The DB takes two tables, a query from the user ( some aggregation or grouping of the data ) and manifests the result as another table on disk. Materialised views are useful when performance is key ( everytime a user does a query, the computation does not have to be repeated ). The materialised view table is recalculated every time data in either of the tables changes. How does this apply to stateful stream processing? In essence, stateful stream processing is about creating materialised views, which manifest as another table or another stream. SSPs typically use Kafka, a distributed log, to achieve statefulness ( I am not entirely sure I understand what exactly Kafka does - I will be doing another post later to clarify some details ).

New trends in web dev and the effect of machine learning on software engineering

After Ben's talk, it was time to dive into some emerging technologies for web development. Laure Phillips spoke on 'How web programming is more than a server and some clients?' She spoke about the traditional approach of segregating a rich internet application into different tiers (ie backend, frontend, database layer ) and programming each layer with a bespoke language. Then she gave some insights into her own research: developing tierless programming frameworks for internet applications.

After Laure's talk on web dev, it was time to examine how the pervasive presence of machine learning in most modern data systems is affecting the field of software engineering. Twitter's Kovas Boguta spoke about "Machine Learning Models: A New Kind of Software Artifact". Kovas' talk highlighted the challenges that traditional software engineering tools are facing when confronted with non-deterministic models (how do you write tests for a machine learning model? how do you test it sufficiently? ).

Flexible Paxos

This talk by Cambridge PhD student Heidi Howard was on my most-anticipated list - primarily, because it provides a new angle on the work done by Leslie Lamport. Unfortunately, lots of procrastination on the night before the conference meant that I had failed to read Lamport's original Paxos paper and thus was more or less seriously lost during the talk. I plan to publish another blog post soon which will give an in-depth overview of Lamport's original Paxos paper and Heidi Howard's Flexible Paxos modification.

Code Mesh 2016

Code Mesh 2016 : Part 1

Thanks to the generous support of the Code Mesh scholarship program, I was able to attend the conference, meet interesting people working on non-mainstream technologies and most importantly added about 100 new entries to my "Learn More About This" list. What follows is a summary I've constructed to remind myself of the talks and provide a convenient reference in future technical research projects.

Code Mesh 2016 Day 1

The first day of the conference kicked off with a keynote by Conor McBride on the topic of Space Monads. My brain - a newcomer to the world of functional programming - was catapulted into the unfamiliar terrain of functors, monads, a peculiar (in terms of syntax ) programming language called Agda and something that eventually compiled to show two moving squares in a console window. What I did ( sort-of) understand is the fact that Agda allows the programmer to specify the type of a program and then automatically generates a program of that type. I am definitely adding space monads to my "Learn More About This" list, but I feel that I can only gain solid understanding of the things that were demonstrated in the keynote after gaining some ground in the functor-monoid-monad territory (Haskell to the rescue? ).

After the keynote, the conference attendees separated into three tracks. I attended a talk on 'Gossiping Unikernels' by Andreas Garnaes (Zendesk). Andreas' discussed rearchitecting a monolith application( everything lives in the same process ) into microservices which are hosted on unikernels. While the traditional 'server-side' stack involves multiple layers of virtualisation ( the application space, the container space, the virtual OS, the hypervisor ), unikernels are able to directly interface with the hypervisor. Because unikernels incorporate less code than traditional operating systems, they have a reduced attack surface and are thus (potentially) more secure. The 'potentially' in the paraphrasing of Andreas' talk is my addition - mainly as a footnote, because I am not entirely clear on what the drawbacks of reducing traditional operating system code are. Do unikernels lose some of the traditional OS security by including less library code?

Furthermore, Andreas discussed a problem that emerged once the monolith had been broken into microservices: service discovery - in other words how do other microservices know who is providing what. Two solutions are available: introduce a centralized messaging component or investigate peer-to-peer solutions. The naive implementation of a peer-to-peer solution would be to regularly heartbeat to all members of the microservices cluster, but this approach would take too much bandwidth, since the number of microservices that a component needs to ping to assert membership grows quadratically with the number of microservice nodes. Thus there is a need for a distrubited membership protocol that eventually detects a faulty node, is fairly fast and does not make high demands on the network. SWIM is such a protocol that achieves its aims by randomly heartbeating and gossiping. For example, in a three-node system with Alice, Bob and Charlie microservices, Alice pings Charlie to check for failure. If Charlie reponds, all is good. If Charlie does not respond within a particular timeframe, Alice pings Bob and asks Bob to ping Charlie. If Charlie, reponds to Bob, he relays this response to Alice. If not, Alice records Charlie as unavailable and propagates this information to Bob (gossiping), who is then aware of Charlie being unavailable and in turn propagates this information to all other nodes in the systems. Thus all nodes eventually recognize that Charlie is no longer available.

After describing the theoretical aspect of SWIM, Andreas gave a demo of a pure implementation of the protocol ( once again dipping toes into functional ) and demonstrated how pure implementations can make the development process easier by allowing different types of io interfaces( sync, async, mock, MirageOS) to wrap the core SWIM implementation on demand.
Finally, the talk concluded with a useful tip on how to apply property based testing to reduce the number of execution paths that have to be tested in a distributed system.

After Andreas' talk, I wandered into the main hall to hear Sophie Wilson speak about the 'Future of Microprocessors' . As someone who spends a big part of her day coming with instructions for a silicon brain to execute, I am ashamed to say that the inner workings of that silicon brain are a bit of a mystery. In spite of my inability to coherently define key terms such as microprocess or transistor, I had previously heard of Moore's Law ("the number of transistors we can fit on a silicon (chip?)" - a lapse in the note taking process - "doubles every 2 years" ). The law is more of an empirical observations and the precise wording, apparently, drifts depending on the chip industry. Sophie showed the original microprocessor designs for the 6502 and the ARM1. All I could do was marvel at the intricate photos and make mental notes to learn more about microprocessors ASAP. In her discussion about FirePath, Sophie demonstrated a set of instructions designed to be truly parallel and remarked that very few moden software development languages are truly parallel and thus are a poor fit for the parallel hardware environment ( a theme that was touched up in Kevin Hammond's ParaFormance talk later in the afternoon ).

The number of transisitors on a chip is rising, but the ability of devices to utilize this increasing capacity is limited by the cost of etching smaller gates and by the immense heat generated by these power dense microprocessors. Thus even if more and more cores are added to future devices, some of these cores will have to be kept dark to keep the device from overheating. The 10x performance increase that was witnessed in the early development of microprocessors will not be witnessed again.

In the afternoon, Mark Priestly spoke about the process behind the birth of new programming paradigms - with a specific slant toward the functional programming paradigm. Mark pointed out two ways to study the concept of a 'history of a programming paradigm' - a backward looking one, captured by Paul Hudak and David Turner and a forward looking one, which is constructed by looking at the prevailing programming problems and challenges in the early 1950s (before the development of the functional paradigm ) and working forwards to see which steps and paths of investigation led to the moden notion of the functional programming paradigm. While Hudak and Turner look to Church's lambda calculus as the first functional programming language, the forward looking history of the functional programming paradigm traces it's origin to the work of Herbert and Simon's theorem proving machine, which gives rise to the list datastructure, which in turn gives rise to Lisp.

After learning a bit about the functional programming paradigm, I went to learn about the 'A Brief History of Distributed Programming: RPC' by Caitie McCaffrey (Twitter) and Christopher Meiklejohn (Universite catholique de Louvain) - one of the talks on my 'Most Anticipated' list. RPC, as I learned during the talk, stands for Remote Procedure Call and in a nutshell ( which I am hopefully presenting here correctly ) allows one computer to execute some functionality on a remote computer. Caitie and Christopher walked the audience through several iterations of the RPC technology, which was captured in the Request for Comments memos published by various industry members. The problem with RPCs can be boiled down to two main issues: (1) Do we treat everything as local or (2) Everything as remote. RPC has re-emerged in the form of new frameworks, such as Finagle for the JVM. None of these frameworks addresses the original issues ('everything local' or everything remote') and only solves the problem of getting things on and off the wire. Future intresting academic research in this area involves: Lasp, Bloom and Spores.

Finally, I went to Neha Narula's (MIT Media Lab ) talk on 'The End of Data Silos: Interoperability via Cryptocurrencies'. Her talk proided an excellent summary on the three ideas that come together to make the blockchain technology possible 1) distributed consensus 2)private key cryptography 3) common data formats and protocols. In distributed consensus, multiple computers come together to agree on a value. This system has to be tolerant against Byzantine failures - failures where certain nodes in the system start sending false or incorrect data. Existing Byzantine fault tolerance protocols require participants to be known ahead of time, which is not possible in a distributed system such as cryptocurrency. This opens up the system to the Sybil attack, in which an attacker creates thousands of subverted nodes to influence the process of distributed consensus. What thwarts this effort in distributed cryptocurrencies is the computational price a new joiner in the blockchain has to pay. A Sybil attack is still possible but would require a large amount of power to be expended by the attacker.