Lessons from QCon: Part 1 – Sympathy with the Machine

At my recent trip to QCon London I had the opportunity to hear some great talks given by some really good speakers. Often I found myself seriously torn by which session to attend in a given time slot because there was so much amazing content crammed into several tracks.The sessions I chose were broadly those which I felt had potential for direct bearing on my current job, either in terms of the nature of our team or the kind of project we’re working on.
Two sessions that I attended for different reasons wound up resonating with me on a single common theme.

Mechanical Sympathy

This was the term used by one of the talks which was explicitly talking about this subject. The title of the talk was ‘Lock-Free Algorithms’ (Given by Martin Thompson aka: @mjpt777 & Mike Baker aka: @mikeb2701) but whilst the other talk ‘Scalable Internet Architectures’ (Given by Theo Schlossnagle aka:@postwait ) did not explicitly use this term, several of the things described hinted at the same basic idea.

That idea being that we, as software engineers, far to often totally ignore the nature of the real systems our software is running on, along our convenient high level abstractions to shield us from thinking about memory, or cpu cycles, or network packets. And in our complacency and general laziness we throw away all the hardware advances of the last 15 years by doing things which are horrifically inefficient.  Simply because we don’t stop to give it some though.

In the talk on lock-free algorithms, they showed how taking the time to construct a simple message queue between a single producer and consumer, which requires no locks could make things more efficient. But more than that, taking some time to consider the nature of the machine, the realities of exactly how a CPU handles memory, what bits inside a CPU are capable of what kinds of calculations, it is possible to make immense gains in performance (or perhaps avoid immense losses)
In their example they highlighted a couple of specific considerations. The first being that the default algorithm required a divide operation to calculate the position in the queue to use. Divide is a fairly expensive operation for processors. By making their queue length a power of 2, they could make use of the fact machines work in binary to construct a mask that allowed a much simpler operation to arrive at the same result.
Secondly, having given a whistle-stop tour of how CPU’s work, they talked about incidental sharing, where two variables, used together in their program would likely be allocated into memory next to each other. Each variable (both pointers) where relatively small structures, a 32bit int (4 bytes). However CPU’s access memory in 64Byte rows. this is the smallest unit they move, and when they access it, they effectively lock the use of everything in that row.
This means that on other core of your CPU can access any of the data stored in that row, and when it needs to it must not just wait for it to become free, but go fetch the latest version from where ever it was, potentially moving the data via slower memory in order to move it between cores or even processors. Obviously if the two variables are both used by the same thread that’s a good thing, but here, were they two variables are specifically used by the two parts of reading from or writing to the queue, this nature is very bad. By padding the size of each variable, they were able to ensure that they would be allocated into separate rows, and thus available for multiple cores to access simultaneously.

They measured their performance in terms of the number of million operations per second the algorithm could sustain. According to their numbers, the basic Java object for achieving a similar function is capable of 5Million ops per second, with latencies spiking at up to 32,000ns depending on the situation.
Their basic lock-free algorithm was similar to an equivalent lock free option in Java, capable of 15m ops per second, but with latencies down at a steady 180ns.
Then introducing their mechanical sympathy…they got up to 85m ops per second. Yes, I said 85 Million operations per second. So we’re not talking about ‘a little bit better’ we’re talking nearly 8 times faster.
Under the hood they showed that the number of operations sent to the cpu was broadly similar, about 4500 operations, however without the machine sympathy, that translated to about 63Billion cpu cycles, with the machine sympathies this came down to a little under 8Billion. This means the ratio of instructions to cycles was horrifically inefficient before, and all they did was make each clock cycle count more.

I’m a sucker for performance improvements, and this was a pretty dramatic example of how taking a little care can yield incredible results. (or being a bit lazy can cost you dearly)

This theme was picked up on in the scalable architectures talk, in particular in an example of considering the way data moves over networks. It doesn’t just move in streams of bytes, it moves in packets, and packets have a maximum size. He gave an example of checking the size of basic icons on a website, which might be incidentally just slightly larger than 1 packet. Ignoring this means you require 2 packets to deliver that icon. But perhaps you can squeeze it down a little, and HALF the cost to your network of serving that icon. We’re not talking about great leaps of engineering here, just caring enough to both doing a little maths, and understanding the impact of your decisions.

Theo discussed a number of factors that he felt the creators of software should know. That it is part of their job to know. It is too common to have software developers think the job is done when the software compiles into a binary. But that is just the very beginning of its life. We have an obligation to produce *operable* software that is actually fit for purpose, and a big part of that is understanding what its impact on the system it runs on will be.
When a customer hits a website, how much traffic will move over your network? how many calls to the back end will be made? How many database calls will that cause? what disk I/O? Are you monitoring all of these factors? Do you know what you system looks like when it’s healthy? If not how the hell do you hope to understand what went wrong when it fails? (and it will fail)

These two talks were both pretty different in terms of style, one being a very specific look at code examples and their implications, the other a high level list of things that you should understand about your architecture. But both left me with a powerful feeling that I don’t understand anything near enough about the details of the software we produce and I had better start finding out.

Throughout Theo’s presentation he provided us with rules, which I shan’t reproduce here except rule #10: Don’t be a F***ing Idiot (idiocy is bad)
What he meant by this was that it is really not that hard to run the numbers, to understand your system properly, and it’s not that hard to apply some logical thinking to the results of those calculations, in particular in questioning those which don’t sound right. Double check your working. Idiocy is contagious, if you stop caring to check things so will others and that is not a good culture to slip into.

I think these two presentations were my favourite of the 2 days of QCon I attended, and I thank the speakers for sharing their experience is such an accessible way. Now for the hard part, taking all my notes and trying to apply them to my own team and project. and try not to make angry with me.