A Spectrum of Actions, Part I

J. Paul Reed
3 min readAug 2, 2023

Anyone who’s ever been involved with an operational incident knows action items are a likely outcome. Heck, they may be a post-incident demand.

What those action items materially are, how important they are (perceived), how they get tracked, how they get resourced, and how they “morph” on their march to “completed” [sic] in the issue tracker… all of these become part and parcel of the story of a particular incident. Because of this, action items remain the topic of perennial debate among SRE and operations teams the world over.

I am often asked by teams and technology leaders about how I think about post-incident action items. My answer? “It depends.” (Because of course it is). This two part series are some musings on my own calculus when answering that question.

One way to model action items is the good ol’ “consultant magic quadrant”: on the X axis, we have (a conceptual representation of) the “cost” of an action item; on the Y axis, we have (a conceptual representation of) its “benefit.”

It might look something like this:

Now, a few things stand out in this model:

  • In the upper left-hand quadrant, we’ve got Low Cost / High Benefit action items. These types of AIs are the bread-and-butter of incident reviews (and “pre-incident reviews!”).
  • Below it, we’ve got the Low Cost / Low Benefit quadrant: these action items may be most aptly described as (“lower interest?” ) “technical debt” or maybe “hack-a-thon day” projects in that they’re comparably easy to accomplish, but beyond addressing “correctness” or some other aesthetic issue that keeps an engineer up at night, they generally won’t get addressed as part of an incident remediation (and they seldom ever become part of “the incident story.”)
  • On the right-side of the quadrant, we’ve got higher cost action items (and increasingly so): these may take the form of large transition projects, such as rewriting a foundational library in a new language or moving to a different platform.
  • As we might expect, there are a couple of “action item boundaries” (different in different organizations, of course; and even different within the lifespan of one organization). Those are in yellow above, and they represent the upper-limit to which an organization will “pay” for the remediation item, and the lower-limit at which an organization doesn’t perceive it will get a return on the investment for that action item, so they don’t seriously consider resourcing it.
    (If this feels a bit Rasmussan Triangle-esque, it should: those are its economic and “work-load” boundaries, just… y’know, not in a triangle.) So given this, we’d expect action items in the lower-right hand quadrant to be fewer and further between, while those in the upper-right hand quadrant require… “more.” (More on “more” later.)

Now, this model is admittedly simplistic, but simplistic models often have the benefit of intuitiveness. In a retrospective, we could throw this quadrant up on a whiteboard, plot our action items, and have a nice, concrete, data-driven conversation about which items we are — and are not — going to do, based on which box they fall in and which side of the boundaries they’re on.

Right?

If you’re laughing right about now, it’s probably because you’ve experienced attempting to have exactly this type of conversation about an incident’s proposed action items, and the simplicity of this model also reveals a weakness you’ve experienced: namely, “benefit” is so loosely defined as to be… useless in that discussion.

The concept of “benefit” is context free, and while a specific incident’s action items does frame “benefit” in a context, having a substantive discussion about “benefit” requires more context than that, from across the socio-technical system.

There are all sorts of “benefits” an action item (or its proponents, rather) might claim — architectural benefits, developer-productivity benefits, product performance benefits, product agility benefits, operational/cloud-cost benefits, direct customer benefit, revenue benefits, to name a few — so plotting “benefit” on one axis is fairly limiting in terms of having a truly “data-driven” conversation about it.

But, there is another way to model action items though, one which may give us more insights into how our organization actually thinks about them. We’ll dive into that in part 2!

--

--

J. Paul Reed

Resilience Engineering, human factors, software delivery, and incidents insights; Principal at Spective Coherence: What Will We Discover Together?