Had the opportunity to read this one on a flight recently. It was a quick read and not a bad one, though it took a fair amount of narrative to get to the core points.
Following in Eliyahu Goldratt‘s footsteps of writing bad fiction to demonstrate business realities, Zweiback takes us through the story of a corporate IT department coming to terms with the fact that they were more often than not looking for a scapegoat rather than attempting to determine the real weaknesses in the way they were managing their tech. The early story is familiar to anybody in the tech world: something goes wrong, and the last person to touch it — even though he had done everything he should have done — gets fired because “somebody has to be held accountable.”
Complexity and Chaos
There are some interesting bits of information throughout the book. For one, it pointed me to A Leader’s Framework for Decision Making and the Cynefin Framework, which neatly encapsulates a lot of what many of us instinctively know about our environments but may have trouble explaining. They note that virtually all decision making takes place in one of four environments:
- Simple, in which a response to a problem is to sense, categorize (based on known rules) and respond;
- Complicated, in which a response is to sense, analyze and respond and favors expertise in the workings of the systems and the ability to work back to a root cause;
- Complex, in which the problem requires one to probe (in order to gather additional information), sense, then respond. The primary difference between this environment and the prior one is that cause-effect relationships can only be known in retrospect, as such solutions favor creativity and willingness to safely experiment.
- Chaotic, in which no manageable patterns exist and the priority is a crisis-management approach to restore the situation — a bit at a time if necessary — to a merely complex one in which probing and responses are possible.
Zwieback also correctly notes that in the IT world today, virtually all environments are complex or chaotic, with so many moving parts that discerning direct cause and effect is in most cases not possible at the time of a crisis. This is a theme that is explored in much more depth in Samuel Arbesman’s Overcomplicated: Technology at the Limits of Comprehension, which is also a good read. He also notes that most managers aren’t particularly well trained at dealing with them, as management education is primarily focused on cause-effect situations.
The “root cause”
Zweibeck takes us through a narrative in which our hero characters slowly come to the understanding that in complex and chaotic systems, the root cause is always the same: impermenance. Given enough changes in such an environment (which is normal in complex IT systems) it is inevitable that some unexpected, untested, undocumented combination of changes to components and business circumstances will occasionally result in unexpected results. And that people who are doing their work to the best of their ability, and in fact doing the right things by apply security patches, system updates and software improvements that support the business can inadvertently introduce such combinations of changes even as they do the “right thing.” As such, “human error” is a symptom of those environments.
He also addresses cognitive biases, including hindsight bias (judging people based on what we know now, not what was known at the time), outcome bias (judging people based on the outcome of the decision, not based on whether they did the right thing based on the information available to them at the time), and fundamental attribution error (the belief that people’s actions are solely their own choice, not ever dictated by outside pressures or influences). In doing so, he points to Kahneman’s work.
In the narrative, he briefly addresses, then glides past the observation that the most qualified people are paradoxically the ones who are often blamed for failure. I wish there had been more time spent on this. Instead, it’s lurking beneath the surface in his discussion of the victim — a network engineer named Mike. I first encountered this kind of thinking from a relative commenting on the two surgeons who had been recommended to my father for his heart surgery many years ago. Both of them, the relative complained, had pretty poor records of patient survival.
The cardiologist had to set us straight: they were the two best surgeons in New York — the one who ultimately operated on my father had done bypass surgery on Bill Clinton just a week before — and the reason they had relatively poor rates was that they were the ones who were took on the most difficult and complicated cases. When adjusted for this, their results were at the top of the pack. The same is true in IT. The person you call in when you can’t figure out what is going on, is also the person mostly likely to have their hands on the keyboard when everything shrieks to a halt. The one who has never had anything fail catastrophically is the newbie you hired last week who hasn’t yet had the opportunity.
Zweibeck concludes the book with our organization coming up with a full template for a Learning Review after each incident, a process which is designed not to assign blame and punish people (except for obvious cases of gross negligence or sabotage), but rather to understand how qualified people trying to solve a problem ended up making wrong decisions, and to fix the system so that such circumstances do not recur. What he describes will be familiar to anybody who has studied how Amazon, Netflix or a number of other top tech companies do this, and often is referred to as a “COE” or “Correction Of Errors” process.
A decent read
There’s no denying that as fiction it isn’t very compelling. The characters are fairly flat and two-dimensional and there isn’t much tension or complexity to the plot. But it’s a quick read and the narrative approach makes the points quite well even if you have a pretty good idea of where the author is going right from the start. He manages to put in a fairly decent amount of useful information about managing complex environments and why it’s different from simple or merely complicated ones, and adds in a fair amount of behavioral psychology to further make the point. You should certain read some of the other works he points to (that I’ve linked), but as an introduction to how to deal with and prevent failures, it’s pretty good and a nice choice to give your CEO next time he asks you about that latest network outage nobody can really explain.