Human Error in Software

antirez · on Dec 2, 2015

I really loved this post, and experimented what Michel says with my hands in the context of Redis. The Redis code base is in general not rocket science, is pretty understandable and does not use too complex ideas. I simply don’t trust my ability to write complex and correct code at the same time, and so far this approach worked well, Redis is quite stable all considered. However the file implementing replication, replication.c, gained complexity very fast compared to the other parts of Redis over the years. It’s 2500 lines of code, while for example cluster.c is 5000 lines of code, but even at 2x the size, cluster.c is simpler to understand. The reason is exactly what Michel says: a code that matches a mental model. How replication.c got its complexity is easily explained. Redis replication was extremely simple at start. Just streaming replication between master and slave(s). You connect, get the initial payload representing the dataset on the master, and starting from there is just streaming of write commands received by the master (from the clients) to the connected slaves. Then we added partial resynchronization. Later we added in-memory replication (called diskless replication) for environments with slow disks, chained replication, and so forth. Instead of redesigning the code in order to cope with the new complexity, I usually approach the issue in a different way: I modify what I’ve with the minimal set of changes to make it working. This usually is a good strategy in my opinion, since to make something truly general sometimes makes it more complex than having something simpler with a few exceptions to handle corner cases. However this approach does not scale a lot, eventually the code is structured like if the problem was simpler, but has a lot of exceptions. When you reached this point, you can no longer build a mental model of how the code works. The result was that we (Redis Labs Redis core team and I) recently found a number of bugs in corner cases that could arise mixing diskless replication with PSYNC with other unexpected events. While fixing those bugs I started to refactor replication.c in order to make it structured in a way that it is possible to create a mental model about it again, so that the actual layout of the code reflects a bit more what are the moving pieces: the action of creating the initial synchronization payload, the slave attempting a partial synchronization, and so forth. There is still work to make, but in general, its very important to write code for which a simple mental mode exists and matches the code layout.

fapjacks · on Dec 2, 2015

Through the years, Redis has become and stayed my favorite database technology. Simple, lightning-fast, with plenty of functionality to do anything you need. This post is off topic with respect to the OP, but I just wanted to take a minute to say thanks.

slowernet · on Dec 2, 2015

I made the decision to use Redis/Ohm and Cuba for a personal project several months ago and I've never felt more connected to and in command of what I'm doing. I owe both these gentlemen a debt of gratitude.

antirez · on Dec 5, 2015

Thank you fapjacks, very appreciated comment.

kevinr · on Dec 2, 2015

While I like James Reason's book and I find its discussion of mental models instructive in my code and in my life, I do have to break with him on his conclusions, which chiefly amount to the takeaway that systems at the level of complexity we're building today are impossible to run safely, and we should shut them all down. (I exaggerate, but only a little.)

A book I like which responds to Reason's is Nancy Leveson's Engineering a Safer World (free PDF from the MIT Press, even! https://mitpress.mit.edu/books/engineering-safer-world) which says, okay, if we don't want to shut these systems (like the Internet) down, how can we run them safely, and provides some guidance.

I gave a short talk about it at Facebook's most recent Security@Scale conference in Boston a couple weeks ago, the video of which is here: https://www.youtube.com/watch?v=e_-n5wX8okQ

Edit: To tie this back to the OP, I think while it's desireable for software to be as simple as is reasonable given constraints, it's decreasingly possible to say that all software can be built so simply that analytic reduction works, and we need tools to help us cope with software systems which exhibit emergent complexity.

j_h_s · on Dec 2, 2015

"the rational behavior would be to read the code, understand what it does, and reject it if it doesn't work for their use case."

This just isn't practical in most cases. I just don't have time to read the source of every tool I use. No matter how much we strive for simplicity, the fact of the matter is that nearly any software system that is useful these days is going to be too big for every developer who uses it to read its source. You have to accept that in a lot of cases, you're going to need to use tools whose inner workings are obscure to you.

Simplicity in programming is great, but we passed the point of understanding all the software we used a LONG time ago.

nulltype · on Dec 3, 2015

The inner workings may be obscure, but I really like it when the outer workings are not. I use the Google Datastore (although Postgres would work here too), which I'm sure is super complicated internally. Externally though, it has certain properties that form a fairly simple mental model.

With that mental model, you can predict from reading some code what the possible error cases or race conditions could be, or what the state of the datastore entity would be after running some code against it. Perhaps that's not precisely "Software Complexity", but "Library Complexity" instead.

I was going to use Redis as an example, but the internal workings are probably too easy to understand.

vitd · on Dec 3, 2015

I concur. In addition to not having the time to double-check everything I run, I don't have a choice of what tool to use in many cases. Either there is no other tool, my employer (or customer, or vender) has dictated that I need to use a particular tool, or other tools are too expensive/run on a different system/whatever.

joepvd · on Dec 2, 2015

Yes. But it is reassuring when a dependency facilitates a relatively painless deep dive into its guts.

Not sure if and how one could know this beforehand, except for taking advise from random strangers in places like this.

_sq6a · on Dec 2, 2015

100% agree with this, particularly on the part emphasized:

"An advanced programmer can create a program that is correct, but complex and hard to understand. For the purpose of creating an accurate mental model, <emphasis>even the program's correctness is of secondary importance: code that is understandable can be fixed.</emphasis>"

douche · on Dec 2, 2015

That's the old Kernighan quote[1]:

"Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?"

[1] https://en.wikiquote.org/wiki/Brian_Kernighan

mwsherman · on Dec 3, 2015

I think Go is optimized around this idea: that the average reader will have an accurate mental model. (This of course implies trade-offs with other considerations.)