Your risk is correlated by stack, not independent. The most dangerous assumption in your architecture is the one you never wrote down.
On the morning of July 19, 2024, the world's computers turned blue at the same time. Not metaphorically, literally, millions of Windows Blue Screens of Death, all at once. Flights stopped; Delta alone cancelled thousands over the following days. Hospitals reverted to paper. Banks, broadcasters, retailers, and stock exchanges went dark together. By the time the dust settled, Microsoft estimated that about 8.5 million Windows machines had crashed, and the cyber-risk modeler Parametrix put the direct losses to Fortune 500 companies (Microsoft excluded) at roughly $5.4 billion.
It was not an attack. Nobody got hacked. A security company called CrowdStrike had pushed a routine configuration update to its Falcon Sensor, the very software organizations install to protect themselves, and a bug in that update bricked every machine it touched, simultaneously, across every continent. The thing everyone had bought in order to be safe was the thing that took them all down at once.
Read that twice, because it is the whole essay: the shared defense was the single point of failure. And if you build software, you have almost certainly built the same trap into your own system, and your reliability math is hiding it from you.
Cyber insurers have a name for what happened on July 19, and an entire architecture built around dreading it. They call it aggregation risk, and it is the structural nightmare that makes cyber a fundamentally different business from every other kind of insurance.
Consider how property insurance survives a catastrophe. A hurricane is enormous, but it is bounded. It hits a coastline, a few states, a known footprint. The insurer who wrote too many policies in Miami balances them against policies in Denver and Seattle, which that storm will never reach. Property catastrophe correlates by geography, and geography is diversifiable: spread your book across the map and no single event can touch all of it at once.
Cyber refuses to cooperate. A cyber catastrophe correlates by technology stack, not geography, and a stack has no map. A zero-day in Microsoft Exchange, a supply-chain compromise of SolarWinds, a bad update from CrowdStrike, a critical bug in Log4j: each hits thousands of insureds simultaneously, across every industry and every continent, for one reason, they all run the same library, the same vendor, the same cloud. There is no Denver to balance against Miami when the exposure is “everyone on Earth using the most popular logging library in Java.” As the underwriters put it, with the flat fatalism of people who have paid these claims: the loss that kills the market is the one everyone shares.
Here is where it stops being an insurance problem and becomes yours.
Every engineer learns to reason about reliability roughly like this. If a component is up 99.9% of the time (“three nines”) then a system of three such components, any one of which can carry the load, is up far more than that. The probability that all three are down at once is 0.001 × 0.001 × 0.001: one in a billion. Three nines, composed, become nine nines. Bulletproof.
That multiplication is one of the most comforting calculations in our field, and it is true, with an asterisk large enough to swallow the entire result. P(all fail) = P₁ × P₂ × P₃ is valid only if the failures are independent. Multiplying the probabilities is defined by independence; that is what the operation means. The moment the three components share a cause (the same cloud region, the same authentication provider, the same base image, the same npm package) the formula quietly falls apart. The probability that all three go down is no longer the product of three tiny numbers. It is approximately the probability that the one shared thing fails: P(all fail) ≈ P(shared dependency fails), a number orders of magnitude larger.
Your nine nines was never real. Your true availability is roughly the availability of the single most important thing your three replicas have in common, and on a bad quarter, that might be two nines.
A clarification worth making, because it keeps you honest: correlation isn't binary. Real dependencies sit somewhere on a spectrum between perfectly independent and perfectly shared. The mistake is rarely “everything fails together”; it's that you assumed independence (ρ = 0, the textbook's tidy product) when the real correlation was something uncomfortably above zero, and for a hard shared dependency, close to one.
This trap has a name, borrowed from the engineers who think hardest about catastrophe, the ones who design nuclear reactors and flight controls. They call it common-mode failure: the failure of redundant systems through a single shared cause. You can bolt three independent flight computers onto an aircraft, but if all three run the same software with the same bug, you don't have triple redundancy, you have one bug with three votes. The Boeing 737 MAX's MCAS read from a single angle-of-attack sensor; the redundancy that would have mattered wasn't there, and two aircraft were lost with everyone aboard.
Your three replicas are those three flight computers. Spinning up a second and third copy of a service protects you beautifully against independent failure: a disk dies, a machine reboots, a process leaks memory and gets recycled. It does precisely nothing against a shared one. Three replicas that all run the same compromised dependency, or all live in AWS us-east-1, or all bake in the same base image at build time, are not a redundant system. They are a single point of failure that you are paying to run three copies of. When us-east-1 has one of its periodic bad days, and it has taken out vast swaths of the internet more than once, your three replicas fail in perfect synchrony, because they were never three risks. They were always one, wearing three names.
If this is starting to sound less like an engineering problem and more like a finance one, that's because it is the same problem, not by analogy, but by identity.
Open any portfolio theory textbook and you'll meet the distinction that won Harry Markowitz a Nobel Prize: idiosyncratic risk versus systematic risk. Idiosyncratic risk is specific to one holding (this CEO quits, that factory burns) and it is diversifiable: hold enough uncorrelated assets and the independent shocks average out, good canceling bad toward the mean. Systematic risk is shared across everything you own (the whole market falls) and it is not diversifiable, because it strikes every holding at once. The mathematics is merciless: the variance of a sum of correlated risks is dominated by the covariance terms, and as the correlation ρ approaches 1, adding more holdings stops helping at all. A portfolio of one stock is not made safe by buying a lot of it.
That last sentence is your three replicas, word for word. And the identity isn't hand-waving, it's in the literature. A paper in MIS Quarterly, “Correlated Failures, Diversification, and Information Security Risk Management,” maps the diversification theorem directly onto correlated IT failures. The cyber-insurance pricing literature (for instance the 2022 paper “Modeling and Pricing Cyber Insurance: Idiosyncratic, Systematic, and Systemic Risks”) splits cyber exposure into exactly the portfolio manager's three buckets. The underwriter pricing your policy, the SRE choosing a replica count, and the quant building a portfolio are all solving the same equation. They simply don't know they're colleagues.
And the equation carries a punchline that every financial crisis re-teaches: in a crisis, correlations go to 1. Assets that looked independent through calm years (that handed you a long, comforting run of diversification) all crash together in 2008, because the event large enough to cause a true catastrophe is, by definition, the thing they had in common. Diversification is a fair-weather friend. It works right up until the correlated tail event arrives, which is the precise moment you needed it. The regional outage, the cyber-cat, the shared-dependency bug: these are your correlations-go-to-one moments, when your redundancy resigns just as the storm everyone shares rolls in.
So what do you actually do? Here the cyber underwriter, who has been losing sleep over this for longer than you have, hands you a playbook with three plays.
Model your aggregation explicitly. The underwriter's central question, the one their whole craft is organized around, is: what single dependency, if it failed, would take out the largest fraction of the book? They buy intelligence on Single Points of Failure (the named ones recur: SolarWinds, Microsoft Exchange, Okta, Log4j, CrowdStrike, Change Healthcare, whose 2024 breach alone ran past a billion dollars), and they run accumulation models from firms like CyberCube, Kovrr, and DeNexus to estimate their Probable Maximum Loss: the worst plausible correlated event. You can run that same scan on your own system, and almost nobody does. Draw your real dependency graph, not the org chart of your services, but the shared substrate beneath them: which region, which auth provider, which base image, which package sits under the maximum number of your critical paths? That dependency's availability is your system's availability. Compute your stack's PML. The number is usually sobering and always clarifying.
Set sublimits on shared exposure. An insurer that finds a dangerous concentration doesn't always refuse it, they cap it, writing a sublimit: we will carry only so much exposure to this one vendor, this one technology. The engineering translation is a concentration budget: no single shared dependency should sit beneath more than some chosen fraction of your critical paths. If your authentication, your data plane, and your control plane all funnel through one provider, you haven't built three systems, you've built one, and the sublimit discipline says break that concentration before it breaks you.
Treat the monoculture as the systemic risk it is. The root cause underwriters name is platformization: the over-dependence of the whole economy on a handful of clouds, SaaS vendors, and open-source projects. The engineering version is the hard recognition that adding replicas to a monoculture only adds copies that share its fate. Real resilience for your critical core demands independent redundancy (a genuinely different region, provider, or implementation) not another instance of the same thing.
The underwriters will also tell you the opposite caution, because they've seen people overreact. Two honest caveats keep the prescription proportional.
The first is cost, and it bites. Independent redundancy is expensive: genuinely multi-cloud, multi-provider infrastructure multiplies your operational complexity, and the failover machinery you add to achieve it can itself become a new shared dependency that fails at the worst possible moment. You cannot afford to de-correlate everything, and you shouldn't try. That is exactly why the aggregation scan comes first: it tells you which shared exposures are worth the price of true independence (the few sitting under the most critical paths) so you spend a finite resilience budget on the genuine catastrophe risks and consciously accept the rest. Some correlated risk you simply can't engineer away: NotPetya in 2017 was so correlated, and so vast, that Merck's insurers contested a roughly $1.4 billion claim by invoking the policy's act-of-war exclusion. For risks like that, the honest move is to reduce concentration where you can and consciously accept what remains, not to pretend a third replica covers it.
The second caveat is subtler, and a little haunting: diversification can manufacture the monoculture. When every engineering team on the planet independently does the smart, locally rational thing (pick the best, most reliable cloud) the system ends up concentrated on three providers that no one chose collectively. Each team feels diversified; the internet becomes a monoculture. This is precisely the exposure financial regulators have begun writing rules about, designating the largest cloud providers as “critical third parties” (the EU's DORA regulation and the Bank of England both now do this), because the systemic risk is real even though no single actor created it on purpose.
CrowdStrike's blue screens have mostly faded from the news, but the lesson under them hasn't moved an inch. The most dangerous assumption in your architecture is the one you never wrote down: that your failures are independent. It hides inside the very math that makes you feel safe (the multiplied nines, the replica counts, the “highly available” in the design doc) and it stays hidden right up until the shared dependency fails and reveals that all your redundancy was correlated the entire time.
So ask the underwriter's question, the one your uptime dashboard will never volunteer: what single dependency, if it failed right now, would take out the largest fraction of my system? Run that scan. Whatever it returns is your real availability, not the comfortable product of independent nines, but the bare number of the one thing everyone shares. Independence is the assumption that quietly turns your reliability math into fiction. And the bill for that fiction never arrives a little at a time. It comes due all at once.
Three checks that share a blind spot are one check with three votes.
The essay's hardest lesson is that redundancy isn't independence: three replicas on the same shared substrate fail together. The same trap waits in how a fleet verifies itself. If one agent checks another's work and both rest on the same model, the same self-report, the same single assumption, your “independent” safeguards are correlated, and a single shared failure takes the whole verification down at once. The Agent Trust Stack is built as genuinely independent layers (a tamper-evident provenance record, learned reputation, and fast checks that don't all derive from the same source) so your trust in an agent doesn't rest on one dependency that, on a bad day, fails for all of them simultaneously.
See a verified action chain · Hosted Chain of Consciousness
pip install agent-trust-stack · npm install agent-trust-stack