Writing a postmortem: an interview exercise I really like

2017-10-31

First, some exciting news that’s relevant as context: starting in November, I’ll be joining Mapbox in their Washington, DC office. One thing that made me excited about joining Mapbox was how much I enjoyed their interview process. No part of it felt like it required extra “preparation” and it was clear that some significant thought had been put into its design. In my opinion, this is a pretty useful signal about an engineering organization. If we accept that finding great engineers (by whatever definition of “great” you subscribe to) is one of the largest contributors to the success of any company, it’s possible that a company with an interview process that appears disorganized or nonsensical is otherwise well-functioning, but it’s probably at least somewhat less likely.

One aspect of the Mapbox interview process that I particularly liked was their request that I write a blameless postmortem as a take-home exercise. If the term is unfamiliar, postmortems are a tool designed to help facilitate a culture of building institutional knowledge and learning from the past. Whenever something goes wrong – an outage, bug in production, failure to meet an SLA, etc – anybody involved in the situation can call for a postmortem. The postmortem takes the form of a shared document where everybody can contribute their account of the incident to help identify its ultimate causes and propose changes to prevent it from happening again. The “blameless” aspect is crucial: a good postmortem avoids conclusions like “Dan wrote a bug and it brought down our service” and instead says “Dan wrote a bug and it brought down the service: we need to improve our testing and deployment processes to make sure that they catch this category of bugs in the future.” For many mistakes that initially look like they ought to be blamed on an individual, it’s possible to identify a deficiency in a process as the root cause. You can read more about blameless postmortems in the Google SRE book.

I thought this exercise was a great interview question because it lets the reader learn more about several different attributes that one might associate with good programmers:

It’s a sample of the type of written communication you’ll be doing frequently in a work environment. Between documentation, taking notes while working, design documents, commit messages, chat, and email, programmers can produce significantly more prose than code. Being able to express technical concepts clearly, unambiguously, and succinctly is essential. I may be wrong, but I sometimes get the feeling that this is widely acknowledged, but without the accompanying recognition that writing is a skill that can be improved relatively easily. Producing excellent writing might take a lot of work, but learning to write decently isn’t that different from improving as a programmer: it takes building up experience, practicing deliberately, taking the time to revise repeatedly, and getting constructive feedback.
It’s a great way for the writer to demonstrate their ability to assume the perspective of others. So much of being a good engineer is having empathy for your users and the other programmers who will read, modify, or interact with your code. Writing a useful postmortem necessarily requires empathy, because you have to put yourself in the shoes of everybody involved in the incident and understand what they were thinking and why they took the actions they did.
It helps assess the writer’s ability to think critically and logically about a complex chain of events. Imagining all possible contributing factors to an incident and identifying the various links of causality is the same type of thinking as is required when trying to think of edge cases or identify the tradeoffs and compromises in a system’s design. Finding ultimate rather than proximate causes, dismissing alternate explanations, and considering counterfactuals are all closely related to the type of critical thinking involved in debugging software or troubleshooting systems.

When I sat down to write my postmortem, I wasn’t sure whether to analyze a programming-related incident or choose something else. The instructions said the postmortem could be about anything I liked, and I decided it might be fun to write about a certain eventful and unusual accident that happened to me a few years ago. What I wrote is definitely not perfect, it’s just one example, and there’s no one “right” way to format or organize a postmortem. Anyway, I’m excited about the new job, and feel free to reach out if you’re curious about the Mapbox hiring process or what it’s like to work there (you should probably give me some time after I’ve started for the latter). Here’s what I sent the hiring team, unedited, in full:

Background

I chose this incident by thinking, “what’s the most memorable unfortunate thing that I’ve been involved in over the past few years?” For context, in 2014 I had purchased an old, smallish (28 foot) sailboat. At the time I had some extra savings, and decided that it would be a enjoyable way to pass the summertime in Boston. I stored the boat on a mooring near Boston Harbor and sailed it frequently in 2014 and 2015.

One weekend in October 2015, I decided to sail to nearby Cohasset Harbor by myself, with plans to anchor there and spend the night onboard. Sailing alone entails a higher workload, as well as having no assistance if something goes wrong. However, I knew the boat well, and I had been sailing on my own frequently that summer. In the morning, I checked the weather forecast carefully. Although the breeze was forecast to increase the following day, everything looked well within the limits of my comfort level and ability. The journey to Cohasset was peaceful and beautiful, with light wind and plenty of sunshine.

That night, I slept somewhat fitfully. Sleeping at anchor is generally nerve-wracking; there’s a constantly lingering worry of being woken by the boat bumping into something because the anchor has come loose. It also takes some adjustment to sleep with the motion of the boat when you’re used to a bed on firmer ground. I woke up to a gentle rain as the sun came up, which gave way to a gorgeously thick bank of fog. I spent the morning reading a book and dozing, waiting for the fog to lift. After eating lunch, it was time to head home. The wind had started to build a little, and a quick check of the forecast told me it was going to continue to strengthen. However, the journey back was only two hours, the wind direction was right, and most of it was in sheltered waters.

The incident

About half of the way home, the wind had built significantly. It wasn’t strong enough to make me feel in danger, but the boat was going at its maximum speed and required constant attention. Suddenly, an especially strong gust hit, and I heard a clank near the mast. I saw that a shroud, one of the metal cables that connect the the mast to the deck and hold it upright, had detached and fallen. Thinking back, I remember feeling more surprised than afraid. After all, I thought, the mast was still secured in several other places. However, I was quite close to shore, and I knew I would soon need to turn to avoid an area of shallows. Then, the wind started to change direction, causing me to panic and make the turn. This was a critical mistake. The change of wind angle altered the forces on the mast, and, in what felt like slow motion, the mast snapped in two at its middle and fell overboard.

Aftermath and response

After recovering, I realized that I needed to take action immediately. I ran to start the gasoline engine, hoping to move to somewhere more sheltered to anchor and buy myself more time to solve the problem. However, if any of the wreckage tangled in the propellor, I would be in deep trouble. Luckily, the engine worked. Fighting the wind and current, the boat moved along at barely one mile per hour. After a painfully long 45 minutes, I finally managed to put the anchor down. A group of good samaritans on another boat saw what had happened and offered their help. Together, we pulled the broken mast and sails out of the water and tied them to the deck. Eventually, I used the engine to travel the last half hour back to my mooring.

Ultimate causes

There were multiple root causes of this incident. Outside of extreme weather conditions, a shroud detaching from the mast is a very low-probability event. The cables themselves are designed to withstand forces far beyond their normal working load. They are secured at both ends by thick steel pins, which only break when severely corroded. Unfortunately, it is difficult to visually inspect the end that attaches to the mast. Most sailors check them when they can, often during the winter when the boat is on land for storage. The rigging on the boat had been completely redone by a reputable contractor in 2013, and visually appeared to be in perfect shape. As a result, I hadn’t had it inspected since I bought the boat. It is probable that, either due to bad luck or error on the part of the contractor, the steel pin managed to wiggle itself loose. Vibration from the gusty winds probably accelerated the process to completion. The entire incident could most likely have been prevented by having the rigging inspected when I purchased the boat.

Although the shroud coming loose was a serious failure, it was not by itself enough to bring down the mast. Appropriate and rapid action on my part would have preserved it. If I had quickly dropped the sails, the pressure on the mast would have eased, and I could have used the engine to return home. Although I had a significant amount of sailing experience, much of it was in smaller boats, and relatively little of it was alone. I did not have enough practice dealing with serious equipment failures at sea on my own, or generally in making decisions under time pressure with significant consequences. My initial reaction to the incident was insufficient recognition of its severity, as well as a lack of awareness that I had more time to assess the situation than I thought. A contributing factor was a lack of sleep from the previous night, which impacted my ability to think clearly.

Analysis and prevention

The first question, and probably the factor I had the most control over: should I have been out there on my own in those conditions? Given that I was alone, should I have waited for the wind to abate? I’ve thought about it many times, and I think that my decision was ultimately the correct one. You don’t learn or improve without pushing yourself towards the edge of your comfort zone, and one reason I purchased the boat was to improve as a sailor. Also, there are always numerous low-probability things that can go wrong. Being cautious and preparing adequately only serves to lower those probabilities, not eliminate them entirely.

I learned several valuable lessons from this incident. When facing situations where you depend on your equipment, it’s worth spending the time and whatever resources you have to ensure that all of it is in peak condition. A few hundred dollars spent on a rig inspection could have prevented the entire incident in the first place. Also, I’ve heard repeatedly that it’s often the second failure that gets you into trouble, not just the first one in isolation. For example, if I had run out of gasoline, or if my engine had failed, the situation would have gone from just a broken mast to putting the entire boat and my personal safety at risk. The compounding nature of failures increases the importance of maintenance for every critical component.

I also learned that making good decisions under pressure is a skill that can be improved like any other. This was one of the few times in my life so far where I’ve had to make decisions in seconds with serious consequences. My inexperience in those situations led to a few sub-optimal choices. However, I’m proud of the fact that I successfully executed the correct series of actions under pressure after my initial mistake. I now recognize the importance of practicing making decisions in lower-stakes scenarios to better cope with higher-stakes situations when they occur.

Dan Puttick