A Spectrum of Actions, Part II

J. Paul Reed
6 min readAug 7, 2023

In part 1 of this 3-part series, we discussed a simple, yet intuitive model for better understanding the spectrum of possible post-incident action items. But, like many simple models, it has its weaknesses.

Can we improve it?

Let’s start by reexamining this concept of “benefit.”

If we accept Sidney Dekker’s assertion in Just Culture that “No professional comes to work to do a bad job,” then we should assume an engineer working on an action item, of any size and effort, has a justification for doing so. Under this rubric, we would believe them when they say “In my professional opinion, this is beneficial enough to work on.”

So let’s just… remove that Y-axis.

Second, let’s tweak the X-axis so that instead of “cost,” it models “the likelihood the action item will be completed.” That last modifier is important: no points for unfinished action items! (Or for gaming the definition of “done,” agile-style.)

Our “consultant magic quadrant” now looks like this:

Having removed the “benefit” axis, this looks more like a… spectrum. There still some carryovers, though: despite being flattened a bit, our yellow “action item boundary” from before is still there.

On the left hand, you’ve got action items that the organization will absolutely do: they are so low cost and such high “benefit,” there’s no reason they wouldn’t get done. Heck, not-doing them would be tantamount to professional misconduct!

One good way to tell action items on this end of the spectrum: they’re already completed by the time the team holds its incident retrospective. Another common indicator: the action item is already in the next-planned sprint (and an ambitious engineer may have already started working on it!)

On the right end, are action items that won’t ever get done. A real-world example: an organization had a multi-day, multi-hundred million dollar outage. It made the news. The “root cause” [sic] was eventually identified and it was related subtle differences in server provisioning.

See, this data center had originally been provisioned over a decade ago, and back then, the provisioning practices weren’t as automated (or even documented) as they are now. The company was in the midst of a multi-year project to migrate clients out and completely decommission that data center. They knew —from other incidents — there were all sorts of gremlins running around that infrastructure. This massive, “in-the-news” incident just beat them to the customer-migration finish line.

An action item was suggested to go back and fix the provisioning issue that “caused” the incident. Even though that remediation task was pretty minor — tweaking some low-level hardware settings on each physical server and rebooting — and could have been done quickly enough, there was intense debate on whether it should be done. The incident pushed customers out of that data center faster (a benefit!), the organization was now aware of the form and presence of (yet another) devilish gremlin they had to work around during the migrations, and that gremlin would perish in a few months anyway.

Thus was born… an action item “the company will never, ever do.” It would never get funded, despite the fact the action item itself was a reasonable remediation, despite the cost, while large, was absolutely dwarfed by the cost of the incident, and despite it “will 100% solve the root cause [sic] of that last massive incident!” In a sense, they had already been “remediating” this issue for months already; so why would they change the remediation tactic they were already using?

(It’s worth noting: the right end of the spectrum also covers the more… “exotic” action items, like “This was caused by a Windows-issue, so let’s migrate our entire self-run physical infrastructure to Linux and rewrite the entire code base.” Sure, Jan… we could do that. But we won’t do that either. Ever.)

The remaining part of the spectrum is, for our purposes, the most fascinating. It’s “the stuff that isn’t so ‘obvious’ or ‘cheap’ that it’s already done by the retro.” But it’s also the action items we know won’t immediately get shot down, or which would e considered so “absurd,” no one even bothers voicing them.

This middle is where the nuanced, deep, sometimes heated conversations about action item appropriateness take place. The discussion and debate here are exemplars of engineers expressing not only their expertise, but actively engaging in the push and pull of organizational power dynamics. This is discretion enacted, which is why we could call it “The Discretionary Space,” another homage to the Rasmussen Triangle.

This process of action item-negotiation is one of the greatest sources of organizational learning we can glean from incidents: when repeatedly observed, we start to better understand where that yellow boundary is, what really are the limits to which an organization will pay for “the fix,” and we get a more concrete understanding of how the organization itself weighs its participants’ concept of “beneficial,” and then executes on that weighted understanding.

In practice (and I’ve done this exercise with teams), this spectrum can provide useful insight. Say, after an incident, we ask engineers to plot their personal estimate of the probability the action items generated in (or outside) of the retro will get completed. Then, we put that in the drawer.

If we revisit that spectrum later — usually a few sprints, say 4–6 weeks — we can roughly see how accurate our incident response team’s probability estimates turned out to be. This creates a feedback loop for engineers about the accuracy of their own perceptions of various parts of the socio-technical system in which they work. It’s not about whether or not the probabilities were correct; that’s all useful data to base learning on!

The Action League: determining the action items since 1995!

For other leaders looking at this data, it provides insight into risk and the organization’s appetite for risk and its abilities to react to it. It paints a picture of both the post-incident modalities and limitations of our organization. In a generative organizational culture, this can be used as input into future action item discussions during future retrospectives on future incidents.

For example, an engineer may propose a different remediation item than they would have previously, because they’ve gained a clearer sense of the organization’s tolerance various types of work, including “fix it” work; managers may notice a “graveyard” on the spectrum where certain types of action items go to die, and prepare themselves to advocate differently for those action items they know really are important. (Or, the spectrum may help provide data for leaders to show such a graveyard even exists!)

But it doesn’t stop there: what if we start to plotting action items from different incidents on the same spectrum? As we do so, patterns begin to emerge about how the organization thinks about various areas of its socio-technical system.

Remember that on this spectrum, the group is plotting their own perceived probability a specific action item will actually get completed; so if we see a set of action items assigned to one functional area that consistently get placed by the group on the right end of the spectrum, this can indicate that functional area is “under water” and needs help bailing out. In a technology context, if all the database remediation items are consistently on the right end of the spectrum, that can represent a signal of increasing risk in specific areas, or even dark debt that continues to go unaddressed.

Now, you might be asking “What the heck is that blue line on the spectrum?”

The answer is: I thought this would all fit nicely in two posts… but it didn’t. So I’ll talk about that weird blue line in the third and final part of the series!

--

--

J. Paul Reed

Resilience Engineering, human factors, software delivery, and incidents insights; Principal at Spective Coherence: What Will We Discover Together?