Managing Risk

We have built some beautiful toolchains that crank out a finished product on the fly without needing anything close to the level of intervention that was historically required. The most advanced organizations on an automation journey could change a line of code and then wait for the new version to hit production without doing a thing. I say “could” because I don’t know of a single org that does this; there are, of course, checks in place and it’s not single-line changes but projects that release. But the infrastructure is there that would allow them to do this. It’s a high probability that one day we all will.

The problem is that most organizations only look at risk when building these toolchains and don’t bother with contingency plans. But over the long life of any given toolchain—toolchains that are made up of many moving parts—there will inevitably be the need for contingency plans.

And it’s certainly not only toolchains—remember when popular libraries were purposely corrupted by their author to make a point? And unless you’ve done your best to ignore it, you’re also aware of faked versions of popular libraries that insert code before continuing on to the original…

Most organizations do not consider a scenario in which critical libraries are suddenly unavailable or a scenario where a tool is crippled by movement in the market—a vendor making harsh licensing changes, the underlying tech making changes that impact the tool, etc. While it is comforting to think that open source protects us from many changes (and it forking does), that is not complete coverage that entirely protects our systems and toolchain. We need to know what is absolutely critical and how the organization will keep delivering in case that absolutely critical element goes away.

When I have done disaster planning—of which this is an adjunct—I have done so with a vague generality in mind. We have plenty of work to do, so all I ever wanted was a list of critical parts and an idea of how we would adapt without them. I worked at a place that kept a backup data center in case of total regional disaster—the backup data center was intentionally located on the other side of the country—but most of us don’t need that kind of DR. Certainly not if we can go, “This OSS library is used in 60% of our apps; if we have to, we could adapt this other one,” instead of building infrastructure.

For OSS dependencies, we have two great options that work together. Generate SBoMs, then correlate usage from those across the entire app portfolio. Then set up and maintain a local repository with point-in-time copies of those libraries that are deemed critical.

A similar, “Our CI tool is critical, but we could easily port processes and YAML to this other one,” is enough, too—it is absolutely not going to be easy if you ever have to do it, but “We looked at this six months ago and we know what to do,” is far better than “Oh crap, now what?”

You built it, you operate it, you are rocking it. Just take some time to protect it and you are covered.