The Default Waterfall

A 150-year-old blast-radius design your multi-tenant system lacks — and why a good one almost never flows.

Published June 2026 · 10 min read

On September 15, 2008, Lehman Brothers filed the largest bankruptcy in history. Sitting inside the London clearing house LCH was a Lehman position most people have never heard of: a $9 trillion portfolio of interest-rate swaps, 66,390 individual trades, against which Lehman had posted about $2 billion of collateral. When a counterparty to a $9 trillion book vanishes overnight, the intuitive expectation is contagion — losses spraying across everyone Lehman traded with.

That is not what happened. Within three weeks, LCH's SwapClear had neutralized the macro risk with hedges, transferred the client positions to solvent members, and auctioned off the rest of the book across five competitive auctions among the surviving banks. The whole default was absorbed using 35% of Lehman's posted margin — Lehman's own money. No other clearing member lost a cent.

The reason isn't that LCH got lucky or improvised brilliantly under pressure. It's that the order in which losses would be absorbed had been decided, in writing, years before Lehman failed. There was a pre-committed sequence of who pays first, second, third — a structure clearing houses call the default waterfall. It is roughly 150 years old, it is mandated by law in two jurisdictions, and almost no multi-tenant software system has anything like it.

What a waterfall actually is

A central counterparty, or CCP, sits in the middle of a market. Through a legal move called novation, it becomes the buyer to every seller and the seller to every buyer. That concentrates a diffuse web of bilateral risk into one place, which raises the obvious question: what happens when one of its members defaults?

The answer is a fixed cascade of financial resources, drawn down in a strict, pre-agreed order:

The defaulter's own initial margin — collateral the failing member posted up front, sized to cover a worst-case one-day move (the 99th-to-99.7th percentile).
The defaulter's contribution to the mutualized default fund — their share of a shared safety pool.
The CCP's own capital — its "skin in the game." The clearing house puts its own equity on the line here.
The surviving members' mutualized fund — everyone else's contributions, tapped only now.
Assessment powers and recovery tools — the extreme last resorts: calling for more cash, tearing up contracts, or haircutting the gains of the winners.

The single most important word in that list is order. Each layer is exhausted before the next is touched, and the sequence is locked in the rulebook before anyone defaults. When Lehman went down, LCH did not convene a committee to decide who should eat the loss. The committee had met years earlier; the answer was already law. The default-management team's job was execution, not deliberation — and execution under a clear pre-commitment is fast, which is exactly why three weeks was enough.

There's a quiet lesson in the Lehman numbers that's easy to miss. The cascade was built to survive catastrophe, and the real catastrophe was comfortably absorbed at the very top layer, with 65% of the defaulter's margin to spare. The deeper layers — the mutualized fund, the assessments — never came into play. A good waterfall is one that almost never flows. The design exists so that the worst case is boring.

When the waterfall does flow

Ten years later, almost to the week, came the counter-example that proves the structure isn't just theoretical comfort. In September 2018, a single Norwegian power trader named Einar Aas had a large bet that Nordic and German electricity prices would converge. On September 10 the spread moved violently the wrong way, and Aas — clearing through Nasdaq's Swedish CCP — blew through his collateral. The loss beyond his posted margin came to €114 million.

This time the waterfall genuinely flowed. Aas's margin was gone, so the cascade reached Nasdaq Clearing's own skin-in-the-game first — about €7 million of the CCP's capital — and then poured into the €166 million mutualized default fund built from the other members' contributions. Roughly two-thirds of that shared fund was consumed, and the surviving members were required to replenish it. The mutualized layer, the one that almost never activates, activated.

Two defaults, ten years apart, are the whole argument in miniature. Lehman was colossal but well-margined, and the waterfall barely engaged. Aas was far smaller but badly under-margined, and the waterfall flowed three layers deep. In both cases the order held: the defaulter's money first, the house's own capital before anyone else's, the shared pool only after that. The structure didn't guarantee zero loss — it guaranteed the loss landed on the right parties in the right sequence, no improvisation required.

Aas also carries a sharper warning. The default was resolved through a closed auction that let only four members bid, and it was criticized afterward for crystallizing the €114 million loss at a possible discount — a fire sale with too few buyers. The order of absorption was correct; the mechanism for liquidating the failed position was not robust enough. Hold that thought, because it has a direct software analog.

The layer that makes it honest

Layer 3 — the CCP's own capital — looks like a small technicality wedged between the defaulter's resources and everyone else's. It is, in fact, the load-bearing piece of the entire design, and the reason is incentives rather than arithmetic.

Because the clearing house must burn its own equity before it reaches the surviving members' mutualized fund, it has skin in the game. It feels the first loss after the defaulter's resources are exhausted, which means it is powerfully motivated to police its members well — to set margins conservatively, to watch concentrated positions, to throw out the reckless. Strip that layer away, and the CCP's incentive inverts: if losses leapt straight from the defaulter to the mutualized pool, the house would bear none of the consequences of lax risk management. It could court volume, under-margin aggressively, and let the members absorb the damage. The skin-in-the-game tranche is what keeps the operator aligned with the people it's supposed to protect.

Regulators take this seriously enough to legislate the amount. Under the EU's EMIR framework, a CCP must contribute at least 25% of its minimum regulatory capital to the waterfall, positioned ahead of the surviving members' default-fund contributions. And the calibration is genuinely hard — the Federal Reserve Bank of Chicago calls it "the Goldilocks problem." Too little skin in the game and the CCP under-polices risk; too much and the members start to free-ride on the house's capital instead of managing their own exposure. Moral hazard flows in both directions. The Bank of England was still consulting in 2025 on whether to add a second skin-in-the-game tranche to sharpen the incentives further. A century and a half in, the layering is still being tuned — which tells you it's a live design problem, not a solved one.

Software measures the blast radius. It rarely orders it.

Now look at how a multi-tenant system handles the identical problem. We even have a name for the thing the waterfall contains: blast radius — the range of impact when a failure occurs, measured along two axes, how many users or transactions are hit and how far the failure propagates across services.

The key word there is measured. We are good at measuring blast radius and we have a solid kit of tools for walling it off: circuit breakers that cut traffic to a failing dependency, per-tenant quotas that cap resource use, bulkhead isolation that gives each tenant its own pool, cellular architectures that confine a failure to one self-contained cell. Every one of these is a wall. None of them is a waterfall. They draw boundaries around the damage, but they do not specify a pre-committed order of who absorbs the cost first, second, and third when the damage exceeds the wall.

So when a noisy tenant blows past its quota, or a cell starts failing in a way the bulkhead didn't anticipate, what actually happens? Someone gets paged, and the loss order gets improvised mid-incident — which engineer is awake, which tenant is loudest on Twitter, which mitigation is quickest to ship. That is precisely the committee-at-the-moment-of-default that LCH had abolished by 2008. The improvisation is the failure mode.

The closest thing SRE already has to a default fund is the error budget — the allowance of unreliability a service is permitted to spend. But error budgets are almost always flat: one budget per service, drawn down by everyone together. They are a pool, not a cascade. There's rarely a committed sequence that says: first this misbehaving tenant burns its own budget, then it burns a buffer it reserved, then — and only then — the platform spends its own budget, and only after that does anyone else get degraded.

The mapping, layer for layer

The translation is almost suspiciously clean, because the underlying problem is the same one: one participant's failure must not bankrupt all the others.

Defaulter's initial margin → the tenant's own quota. The misbehaving tenant absorbs its failure out of its own allocation first. The traffic it can't handle, the latency it incurs, the retries it spawns — capped at its own line.
Defaulter's default-fund contribution → the tenant's reserved buffer. A pre-funded slice each tenant contributes, drawn next, before anything shared is touched.
CCP's skin in the game → the platform's own error budget. This is the layer almost no one implements, and it is the load-bearing one. The platform operator should burn its own reliability budget — eat the cost, absorb the degradation in its own control plane — before any customer is degraded.
Surviving members' mutualized fund → the shared emergency pool. The communal capacity, tapped only after the platform has spent its own.
Assessment powers / variation-margin gains haircutting → proportional degradation of healthy tenants. The last resort, and a fairer one than it sounds: when everything else is exhausted, you trim the "winners" — the healthy, high-usage tenants — proportionally, rather than letting failures cascade randomly and take down whoever happens to be downstream. In clearing, this is literally called haircutting the gains of those with winning positions. In software, it's graceful, proportional shedding instead of a random brownout.

The skin-in-the-game layer deserves the same emphasis here that it gets in finance, because it does the same job: it keeps the operator honest. If your platform degrades customers before it degrades itself, you are the CCP with no equity at risk — structurally incentivized to under-invest in reliability, because the costs land on someone else. If your platform spends its own budget first, you feel the first loss after the offending tenant, and suddenly you are motivated to prevent failures, isolate noisy neighbors, and size buffers honestly. Same Goldilocks tuning, too: make the platform's self-imposed buffer too small and you under-invest; too large and your tenants stop managing their own consumption because they assume you'll always catch them.

Two cautions finance learned the hard way

Borrowing the structure means borrowing its known failure modes, and clearing has catalogued them for you.

The first is concentration. The whole point of a CCP is to pull diffuse risk into one resilient place — but post-2008, regulators themselves acknowledged that this concentrates systemic risk: the absorber becomes the thing whose failure is unthinkable, and no live CCP failure has yet tested the resolution regime. The software lesson is to not let your shared emergency pool, or the platform control plane that administers the cascade, become a single point of failure that takes the whole system with it when it is the thing that breaks. The mechanism that contains blast radius must itself have a small blast radius.

The second is procyclicality. Clearing margins spike hardest exactly during a crisis — the UK pension crisis of 2022 and the March 2020 "dash for cash" both saw margin calls explode at the worst possible moment, draining liquidity precisely when it was scarcest. Your reserved buffers behave the same way: the capacity a tenant set aside gets consumed fastest during the very incident it was meant to survive. Size the layers for the correlated bad day, not the average one — and remember the Aas auction lesson, that the recovery mechanism (how you shed or liquidate the failing tenant's load) matters as much as the order of who pays. A correct cascade with a fire-sale recovery still crystallizes avoidable loss.

Decide the order before the member defaults

There's a reason the people who run trillion-dollar markets settle the loss order in advance and write it into a rulebook: under real stress, there is no time to be fair, and improvised fairness isn't fair anyway — it rewards whoever is loudest or luckiest. The waterfall converts a moment of panic into the execution of a plan made in calm.

For a multi-tenant system or an agent fleet, the move is concrete. Write down, before the next incident, the exact sequence: a misbehaving tenant or agent is absorbed first by its own quota, then by its reserved buffer, then — the layer to actually build — by the platform's own error budget, then by a shared pool, and only as a last resort by proportional degradation of the healthy. Commit it. Make the operator's own budget burn before any customer's. And size each layer for the correlated catastrophe, with a recovery mechanism that won't turn into a four-bidder fire sale.

Then aim, as LCH did, to build it so well it never flows. The $9 trillion default that cost the bystanders nothing wasn't a triumph of heroics in the moment. It was a triumph of a decision made years earlier, when there was still time to make it carefully. Your next noisy-neighbor incident is going to force the same decision. The only question is whether you make it now, in the calm, or at 3 a.m. with the pager going off — which is to say, whether you have a waterfall, or just walls and good intentions.

The skin-in-the-game layer is an accountability commitment — and an agent fleet needs one too.

A default waterfall only works because the loss order is written down and the operator's own stake burns first, on the record, not improvised. For a fleet of agents acting on each other and on your systems, that pre-commitment needs an accountability substrate: verifiable identity for which agent acted, signed provenance for what it consumed and did, and portable reputation that makes a reckless member's history visible before it defaults. The Agent Trust Stack is where you write the rulebook down and prove the order held — so the cascade is execution, not a 3 a.m. committee.

pip install agent-trust-stack · npm install agent-trust-stack
vibeagentmaking.com → · See the stack in action

← Back to all posts