thoughtwisps One commit at a time

Code Mesh Day 2

Notes on Day 2 of the Code Mesh 2016 conference in London.

Stateful stream processing

With the main keynote event moved to the afternoon, Day 2 of Code Mesh 2016 launched directly into the sessions. In the morning, I attended Streaming, Database & Distributed Systems: Bridging the Divide by Ben Stopford from Confluent. It seems that distributed systems is the hot computer science topic du jour and I, not wanting to be left out of the cool kids crowd, headed over to listen and learn from Ben’s expertise in streaming and distributed systems. Ben outlined two goals for the talk: understand stateful stream processing(SSP) and argue that SSP can be a ‘general framework for building data-centric systems’. There are several flavours of data-analytics systems

  • database (eg Postgres - provides a consistent view of the data )
  • analytics database (eg Hadoop, Spark - specialises in aggregations performed over large datasets )
  • messaging system ( has ephemeral state )
  • stream processing ( manipulate concurrent streams of events and perform computations on the streamed data)
  • stateful stream procesing ( a branch of stream processing )

If a database’s query engine traverses a finite data table, the stream processor’s query engine is designed to operate over an infinite dataset that, however, has to be bounded by a ‘window’. The ‘window’ delineates how many ‘ticks’ of data are allowed into the stream processor’s query engine. The query engine then executes a continous query and emits data at some frequency. This behaviour is somewhat analogous to a materialised view in a database. The DB takes two tables, a query from the user ( some aggregation or grouping of the data ) and manifests the result as another table on disk. Materialised views are useful when performance is key ( everytime a user does a query, the computation does not have to be repeated ). The materialised view table is recalculated every time data in either of the tables changes. How does this apply to stateful stream processing? In essence, stateful stream processing is about creating materialised views, which manifest as another table or another stream. SSPs typically use Kafka, a distributed log, to achieve statefulness ( I am not entirely sure I understand what exactly Kafka does - I will be doing another post later to clarify some details ).

After Ben’s talk, it was time to dive into some emerging technologies for web development. Laure Phillips spoke on ‘How web programming is more than a server and some clients?’ She spoke about the traditional approach of segregating a rich internet application into different tiers (ie backend, frontend, database layer ) and programming each layer with a bespoke language. Then she gave some insights into her own research: developing tierless programming frameworks for internet applications.

After Laure’s talk on web dev, it was time to examine how the pervasive presence of machine learning in most modern data systems is affecting the field of software engineering. Twitter’s Kovas Boguta spoke about “Machine Learning Models: A New Kind of Software Artifact”. Kovas’ talk highlighted the challenges that traditional software engineering tools are facing when confronted with non-deterministic models (how do you write tests for a machine learning model? how do you test it sufficiently? ).

###Flexible Paxos

This talk by Cambridge PhD student Heidi Howard was on my most-anticipated list - primarily, because it provides a new angle on the work done by Leslie Lamport. Unfortunately, lots of procrastination on the night before the conference meant that I had failed to read Lamport’s original Paxos paper and thus was more or less seriously lost during the talk. I plan to publish another blog post soon which will give an in-depth overview of Lamport’s original Paxos paper and Heidi Howard’s Flexible Paxos modification.