Hello and welcome to thoughtwisps! This is a personal collection of notes and thoughts on software engineering, machine
learning and the technology industry and community. For my professional website, please see
race-conditions.
Thank you for visiting!
20 Dec 2016
Notes on Day 2 of the Code Mesh 2016 conference in London.
Stateful stream processing
With the main keynote event moved to the afternoon, Day 2 of Code Mesh 2016 launched directly into the sessions. In the morning,
I attended Streaming, Database & Distributed Systems: Bridging the Divide by Ben Stopford from Confluent. It seems that distributed
systems is the hot computer science topic du jour and I, not wanting to be left out of the cool kids crowd, headed over to listen and learn
from Ben’s expertise in streaming and distributed systems. Ben outlined two goals for the talk: understand stateful stream processing(SSP)
and argue that SSP can be a ‘general framework for building data-centric systems’. There are several flavours of data-analytics systems
- database (eg Postgres - provides a consistent view of the data )
- analytics database (eg Hadoop, Spark - specialises in aggregations performed over large datasets )
- messaging system ( has ephemeral state )
- stream processing ( manipulate concurrent streams of events and perform computations on the streamed data)
- stateful stream procesing ( a branch of stream processing )
If a database’s query engine traverses a finite data table, the stream processor’s query engine is designed to operate over an infinite dataset
that, however, has to be bounded by a ‘window’. The ‘window’ delineates how many ‘ticks’ of data are allowed into the stream processor’s query engine.
The query engine then executes a continous query and emits data at some frequency. This behaviour is somewhat analogous to a materialised view
in a database. The DB takes two tables, a query from the user ( some aggregation or grouping of the data ) and manifests the result as another
table on disk. Materialised views are useful when performance is key ( everytime a user does a query, the computation does not have to be repeated ).
The materialised view table is recalculated every time data in either of the tables changes.
How does this apply to stateful stream processing? In essence, stateful stream processing is about creating materialised views, which manifest as another table
or another stream.
SSPs typically use Kafka, a distributed log, to achieve statefulness ( I am not entirely sure I understand what exactly Kafka does - I will be doing another post later
to clarify some details ).
New trends in web dev and the effect of machine learning on software engineering
After Ben’s talk, it was time to dive into some emerging technologies for web development. Laure Phillips spoke on ‘How web programming is more than a server and some clients?’ She spoke about the traditional approach of segregating a rich internet application into different tiers (ie backend, frontend, database layer ) and programming each layer with a bespoke language. Then she gave some insights into her own research: developing tierless programming frameworks for internet applications.
After Laure’s talk on web dev, it was time to examine how the pervasive presence of machine learning in most modern data systems is affecting the field of software engineering. Twitter’s Kovas Boguta spoke about “Machine Learning Models: A New Kind of Software Artifact”. Kovas’ talk highlighted the challenges that traditional software engineering tools are facing when confronted with non-deterministic models (how do you write tests for a machine learning model? how do you test it sufficiently? ).
###Flexible Paxos
This talk by Cambridge PhD student Heidi Howard was on my most-anticipated list - primarily, because it provides a new angle on the work done by Leslie Lamport. Unfortunately, lots of procrastination on the night before the conference meant that I had failed to read Lamport’s original Paxos paper and thus was more or less seriously lost during the talk. I plan to publish another blog post soon which will give an in-depth overview of Lamport’s original Paxos paper and Heidi Howard’s Flexible Paxos modification.
05 Nov 2016
Code Mesh 2016 : Part 1
Thanks to the generous support of the Code Mesh scholarship program, I was able to attend the conference, meet interesting people working on non-mainstream technologies and most
importantly added about 100 new entries to my “Learn More About This” list. What follows is a summary I’ve constructed to remind myself of the talks and provide a convenient
reference in future technical research projects.
Code Mesh 2016 Day 1
The first day of the conference kicked off with a keynote by Conor McBride on the topic of Space Monads. My brain - a newcomer to the world of functional programming -
was catapulted into the unfamiliar terrain of functors, monads, a peculiar (in terms of syntax ) programming language called Agda and something that eventually compiled
to show two moving squares in a console window. What I did ( sort-of) understand is the fact that Agda allows the programmer to specify the type of a program and then automatically generates a program of that type.
I am definitely adding space monads to my “Learn More About This” list, but I feel that I can only gain solid understanding
of the things that were demonstrated in the keynote after gaining some ground in the functor-monoid-monad territory (Haskell to the rescue? ).
After the keynote, the conference attendees separated into three tracks. I attended a talk on ‘Gossiping Unikernels’ by Andreas Garnaes (Zendesk). Andreas’ discussed rearchitecting a monolith
application( everything lives in the same process ) into microservices which are hosted on unikernels. While the traditional ‘server-side’ stack involves multiple layers of virtualisation
( the application space, the container space, the virtual OS, the hypervisor ), unikernels are able to directly interface with the hypervisor. Because unikernels incorporate less code than
traditional operating systems, they have a reduced attack surface and are thus (potentially) more secure. The ‘potentially’ in the paraphrasing of Andreas’ talk is my addition - mainly as a footnote, because
I am not entirely clear on what the drawbacks of reducing traditional operating system code are. Do unikernels lose some of the traditional OS security by including less library code?
Furthermore, Andreas discussed a problem that emerged once the monolith had been broken into microservices: service discovery - in other words how do other microservices know who is providing what. Two solutions
are available: introduce a centralized messaging component or investigate peer-to-peer solutions. The naive implementation of a peer-to-peer solution would be to regularly heartbeat to all members of the
microservices cluster, but this approach would take too much bandwidth, since the number of microservices that a component needs to ping to assert membership grows quadratically with the number of microservice
nodes. Thus there is a need for a distrubited membership protocol that eventually detects a faulty node, is fairly fast and does not make high demands on the network. SWIM is such a protocol that achieves its
aims by randomly heartbeating and gossiping. For example, in a three-node system with Alice, Bob and Charlie microservices, Alice pings Charlie to check for failure. If Charlie reponds, all is good. If Charlie
does not respond within a particular timeframe, Alice pings Bob and asks Bob to ping Charlie. If Charlie, reponds to Bob, he relays this response to Alice. If not, Alice records Charlie as unavailable and
propagates this information to Bob (gossiping), who is then aware of Charlie being unavailable and in turn propagates this information to all other nodes in the systems. Thus all nodes eventually recognize that
Charlie is no longer available.
After describing the theoretical aspect of SWIM, Andreas gave a demo of a pure implementation of the protocol ( once again dipping toes into functional ) and demonstrated how pure
implementations can make the development process easier by allowing different types of io interfaces( sync, async, mock, MirageOS) to wrap the core SWIM implementation on demand.
Finally, the talk concluded with a useful tip on how to apply property based testing to reduce the number of execution paths that have to be tested in a distributed system.
After Andreas’ talk, I wandered into the main hall to hear Sophie Wilson speak about the ‘Future of Microprocessors’ . As someone who spends a big part of her day coming with instructions for a
silicon brain to execute, I am ashamed to say that the inner workings of that silicon brain are a bit of a mystery. In spite of my inability to coherently define key terms such as microprocess or
transistor, I had previously heard of Moore’s Law (“the number of transistors we can fit on a silicon (chip?)” - a lapse in the note taking process - “doubles every 2 years” ). The law is more
of an empirical observations and the precise wording, apparently, drifts depending on the chip industry. Sophie showed the original microprocessor designs for the 6502 and the ARM1. All I could do
was marvel at the intricate photos and make mental notes to learn more about microprocessors ASAP. In her discussion about FirePath, Sophie demonstrated a set of instructions designed to be truly parallel
and remarked that very few moden software development languages are truly parallel and thus are a poor fit for the parallel hardware environment ( a theme that was touched up in Kevin Hammond’s ParaFormance
talk later in the afternoon ).
The number of transisitors on a chip is rising, but the ability of devices to utilize this increasing capacity is limited by the cost of etching smaller gates and by the immense heat generated by these
power dense microprocessors. Thus even if more and more cores are added to future devices, some of these cores will have to be kept dark to keep the device from overheating.
The 10x performance increase that was witnessed in the early development of microprocessors will not be witnessed again.
In the afternoon, Mark Priestly spoke about the process behind the birth of new programming paradigms - with a specific slant toward the functional programming paradigm. Mark pointed out two ways to study the
concept of a ‘history of a programming paradigm’ - a backward looking one, captured by Paul Hudak and David Turner and a forward looking one, which is constructed by looking at the prevailing programming
problems and challenges in the early 1950s (before the development of the functional paradigm ) and working forwards to see which steps and paths of investigation led to the moden notion of the
functional programming paradigm. While Hudak and Turner look to Church’s lambda calculus as the first functional programming language, the forward looking history of the functional programming paradigm
traces it’s origin to the work of Herbert and Simon’s theorem proving machine, which gives rise to the list datastructure, which in turn gives rise to Lisp.
After learning a bit about the functional programming paradigm, I went to learn about the ‘A Brief History of Distributed Programming: RPC’ by Caitie McCaffrey (Twitter) and Christopher Meiklejohn (Universite catholique de Louvain) - one of the talks on my ‘Most Anticipated’ list. RPC, as I learned during the talk, stands
for Remote Procedure Call and in a nutshell ( which I am hopefully presenting here correctly ) allows one computer to execute some functionality on a remote computer. Caitie and Christopher walked the audience through
several iterations of the RPC technology, which was captured in the Request for Comments memos published by various industry members. The problem with RPCs can be boiled down to two main issues: (1) Do we treat everything as local
or (2) Everything as remote. RPC has re-emerged in the form of new frameworks, such as Finagle for the JVM. None of these frameworks addresses the original issues (‘everything local’ or everything remote’) and only solves
the problem of getting things on and off the wire. Future intresting academic research in this area involves: Lasp, Bloom and Spores.
Finally, I went to Neha Narula’s (MIT Media Lab ) talk on ‘The End of Data Silos: Interoperability via Cryptocurrencies’. Her talk proided an excellent summary on the three ideas that come together to make the blockchain
technology possible 1) distributed consensus 2)private key cryptography 3) common data formats and protocols. In distributed consensus, multiple computers come together to agree on a value. This system has to be
tolerant against Byzantine failures - failures where certain nodes in the system start sending false or incorrect data. Existing Byzantine fault tolerance protocols require participants to be known ahead of time, which
is not possible in a distributed system such as cryptocurrency. This opens up the system to the Sybil attack, in which an attacker creates thousands of subverted nodes to influence the process of distributed consensus.
What thwarts this effort in distributed cryptocurrencies is the computational price a new joiner in the blockchain has to pay. A Sybil attack is still possible but would require a large amount of power to be expended
by the attacker.
23 Aug 2016
What can a technologist do about climate change?
It is 22:14. The 23rd of August, 2016. The days are starting to get shorter, the nights slightly longer.
It is 22:14 at night and it is hot.
Someone has aimed a hot hair dryer at London and kept it running all day.
Even as the sun set for the day, the exhausted, tired air of the city kept shimmering from the heat.
We are breaking records. In sports, on stadiums and in pools. And in statistics.
This July was the hottest month recorded since recordkeeping began.
This Tuesday, the tally was 89 to 78. The residents of Shishmaref, an Inupiat community on the island of Sarichef off the coast of Alaska voted to move, because the land they had inhabited for centuries has become too unstable due to the effects of climate change.
If you are worried about this and wonder what you as a techie/software engineer/concerned technically-minded citizen can do, I encourage you to read Bret Victor’s “What can a technologist do about climate change?”.
22 Aug 2016
I am currently working through Leslie Lamport’s TLA+ Hyperbook and writing up these notes/summaries/review questions to hopefully help me internalise the material.
Notes on TLA+ Hyperbook by Leslie Lamport
What follows are my notes on Leslie Lamport’s book TLA+ Hyperbook.
Chapter1: Introduction
What is concurrent computation? What is parallel computation?
In concurrent computation, things occur at the same time. In parallel computation, a single task is executed concurrently (that is in chunks that occur at the same time ). Parallelism is optional (unless prohibited by costly computation time) and is usually easier than concurrency, because the programmer is in control of which chunks of the single task are executed concurrently.
What is a digital system? How to choose a suitable abstraction for a digital system?
A digital system is a system that performs computation as a collection of discrete events. What constitutes a discrete events depends on the abstraction we are using to model the system and who will be using that abstraction. Lamport given an example of a digital calculator. For a user of the digital calculator, pressing the key 5 represents one event, while for the calculator engineer the act of pressing can be two events. The abstraction chosen to model the system must be simple enough to model the system well.
What is the Standard Model?
A system is a collection of behaviours.
A behaviour is a sequence of states and represents an execution path of the system.
Each state is an assignment of values.
This is interesting! Can we come up with some “real world” example? Let’s go with an example of an apple. ( To everyone that grows apples, I apologise - my knowledge of orchards is very poor!). An apple is a system that has many different execution paths. It starts its life as a flower on the apple tree and from there can follow any number of execution paths
-
flower -> frost bite
-
flower -> pollination -> ripped out by storm
-
flower -> pollination -> unripened apple -> eaten by bird
-
flower -> pollination -> unripened apple -> ripe apple
etc. I’m sure you can come with many more execution paths for an apple.
A specification is a description of a model, a formal spec a description that is written in precisely defined language.
21 Aug 2016
One the Internet This Week - Week 33
Donald Knuth answers all kinds of questions from faculty, students and questions submitted by internet viewers.
I always enjoying watching Don Knuth lecturing or speaking ( it must have been amazing to attend Stanford and see Knuth lecturing live! ),
and this video is no exception. Some of the things that stood out for me from this recording:
-
Literate programming is the greatest thing since sliced bread. Programs should be written for people, not for computers.
-
Look up ‘Selected Papers on Fun and Games’ by Don Knuth
-
The hardest mathematical problem Knuth has resolved is called the ‘Birth of a Giant Component in a Random Graph’. If you start with a graph with random vertices and start randomly joining vertices together, a giant change happens when the number of connections you have added is approximately one half the number of vertices. By using ideas from complex analysis, it was possible to ‘slow this down and watch it happen by measuring time in a different war’ and thus it was possible to study this change. When Knuth goes into a problem, he tries to train his brain. The first week is baby steps, after a few weeks giant steps. All this happens by getting familiar with the problem domain.
-
“There is no royal road to software, anymore than there is a royal road to mathematics.”
A great (though slightly outdated - using Python 2.6 for examples ) post about the various ways to synchronize threads in Python. I have very little experience of threading or synchronization and found this post very approachable.