Stop modeling traffic as one pool. Two services with identical total QPS can have opposite fragility, and the total number hides which one you have.
On Thursday, March 9, 2023, customers of Silicon Valley Bank tried to withdraw about $42 billion in a single day. By the following morning the bank was gone, seized by regulators in the second-largest bank failure in U.S. history, leaving an estimated $16.1 billion hole in the federal Deposit Insurance Fund, according to the Federal Reserve's own Office of Inspector General review published that September. The post-mortems fixated, correctly, on duration risk: SVB had parked a mountain of deposits in long-dated bonds and watched their market value crater as the Fed raised rates. That was the kindling.
But kindling doesn't explain the speed. Banks have mismanaged interest-rate risk for as long as there have been banks, and most of them die slowly, over quarters, in a managed wind-down. SVB died between lunch and close. The reason that run was fast enough to outrun the regulators was sitting in plain sight on the other side of the balance sheet. Not in the assets, but in the deposits themselves.
More than 90% of SVB's deposits were uninsured, the highest share of any major U.S. bank. And they belonged to a single, tightly-wound tribe: venture-backed startups, who all banked at the same place, shared the same investors, read the same Twitter feed, and could move millions with a thumb on a phone. When a few prominent VCs told their portfolio companies to pull out, the portfolio companies all moved at once, in the same direction, instantly. SVB's total deposits had looked fine for years. Its composition was a loaded gun: concentrated, uninsured, correlated, and digitally mobile.
Here is the part that should interest anyone who plans capacity for a living. Bank regulators saw this exact failure mode coming more than a decade ago, and they responded by outlawing the way most of us model load.
After 2008, the Basel Committee built a stress test called the Liquidity Coverage Ratio, or LCR. The mechanic is simple to state: a bank must hold enough High-Quality Liquid Assets to survive 30 days of a defined panic, with liquid assets divided by stressed net outflows staying at or above 100%. The genius is not in the ratio. It's in how the denominator gets computed.
The LCR refuses to treat "deposits" as one number. Instead, every category of funding is assigned its own run-off rate (the fraction regulators assume will flee in those 30 days) based purely on how flighty that kind of money has proven to be. The Basel framework's published outflow schedule (chapter LCR40 of the Basel Framework, on the Bank for International Settlements site) lays out a structure that, in its standardized form, looks roughly like this:
Stare at the ends of that table for a second. The same dollar of deposit is assumed to be anywhere from 3% to 100% likely to flee, a thirty-three-fold spread, based on nothing about the dollar itself, only on who owns it and why they parked it there. A 2015 analysis from UC Berkeley's law school put the whole philosophy in its title: "All Cash Is Not Created Equal."
The consequence is the load-bearing idea of this entire essay. Two banks with identical total deposits can have wildly different fragility. A bank that is 80% stable insured retail needs a small liquid buffer to pass the test. A bank that is 80% uninsured financial wholesale needs an enormous one, or it fails. The danger lives in the composition, and the total deposit figure hides it completely.
SVB is the smoking gun for that claim. Its deposit base was, in LCR terms, a 100%-run-off liability wearing the costume of an ordinary deposit base. A bank holding SVB's exact dollar total but funded by sticky, insured, retail checking accounts would have walked through the same week untouched. The run wasn't lethal because it was big. It was lethal because the flighty segment is also the fast segment, and SVB was almost all flighty segment.
Now look at how we plan for load.
Almost every capacity conversation collapses traffic into a single scalar. "We can handle 50,000 requests per second." "We're provisioned for 100,000 concurrent users." "The load test held at peak." And behind that one number sits one implicit behavioral model: one retry policy, one timeout, one assumption about how users behave when the system gets slow.
That is the error banking banned. It is modeling the total without modeling the mix. And the total cannot tell you the one thing you actually need to know: whether the next surge is the sticky kind or the flighty kind, because, exactly as with SVB, those two have opposite survival outcomes.
Borrow a phrase from the banking world: deposits have a beta, a sensitivity that varies by type. (In banking, "deposit beta" technically means how fast deposit rates track central-bank rates; I'm stretching it deliberately, as a metaphor for stickiness.) Your traffic has a beta structure of its own, and once you look for it, the funding types are obvious:
There's a quiet irony here that the discipline should find embarrassing. Every product team on earth already segments users obsessively (cohort retention, stickiness, churn prediction); tools like Amplitude and Adjust exist because a 30-day cohort behaves nothing like a six-month cohort. We segment users to the decimal place for growth metrics. Then we turn around and model load as one undifferentiated pool for survival. The analytical muscle is fully developed. It's just pointed at the wrong question.
Why is the flighty segment disproportionately dangerous, rather than just somewhat worse? Because it's reflexive, and reflexivity is the mechanism that turns a wobble into a collapse, in finance and in distributed systems alike.
A bank run is reflexive: the act of withdrawing makes the bank weaker, which gives everyone else a better reason to withdraw, which makes it weaker still. Nobody is being irrational. Each depositor is making the locally correct decision; the correctness is what makes it a cascade.
Distributed systems have a structurally identical demon, and it even has a name: the thundering herd, or retry storm. A service degrades. Thousands of clients time out at roughly the same moment and retry at roughly the same moment. Those synchronized retries land on an already-overloaded service and push it from "slow" into "down," which triggers another synchronized wave of retries. The retrying causes the outage that justifies more retrying. It is a bank run made of HTTP requests.
And here's the detail that should make the analogy feel less like a metaphor and more like the same physics: the standard fixes for both are the same idea wearing different clothes. To stop a bank run you break the correlation and interrupt the reflexivity: deposit insurance removes the incentive to be first out the door, and a bank holiday or a circuit breaker on the exchange physically stops the cascade. To stop a retry storm you do the same two things. Jitter (adding randomized delay to exponential backoff) desynchronizes the clients so they stop arriving in a correlated wave; that's the engineering equivalent of staggering everyone's withdrawals across the month. A circuit breaker trips open and stops sending doomed requests at all; that's the bank holiday. Client-side throttling caps the stampede at the source.
Sticky traffic doesn't stampede. Flighty traffic does. The stampede, not the raw volume, is the kill mechanism. Which means the question "how much total load can we take?" is asking about the wrong variable entirely.
If this sounds like a lot of new machinery to bolt onto your stack, here's the reassuring part: the most mature reliability practice on the planet is already doing the LCR's job. It just never borrowed the banking vocabulary.
Google's Site Reliability Engineering book describes a system where every remote procedure call is tagged with a criticality level: CRITICAL_PLUS, CRITICAL, SHEDDABLE_PLUS, SHEDDABLE. And critically (the pun is unavoidable), that criticality propagates downstream: a request inherits the criticality of whatever spawned it, so fragility flows through the entire call graph automatically. Read that next to the LCR and the resemblance is uncanny. The LCR assigns a run-off weight per funding type and lets fragility propagate through the funding base; Google assigns a criticality per request type and lets it propagate through the request graph. They are the same structural idea, invented independently on two continents by people who never compared notes.
The rest of the SRE toolkit fills in the analogy. Load shedding under overload means selectively rejecting the less-critical requests to protect the core ones: a web service, in Google's own framing, declining non-essential API calls so the essential ones survive. Graceful degradation means serving a cheaper, less-complete answer rather than failing outright. And the canonical example in the SRE book is a deposit-type distinction in disguise: search-as-you-type suggestions are described as "highly sheddable" (nice to have, drop them first under stress) while the actual search query behind them is critical. Two funding types of request, with opposite run-off treatment, living inside the same product.
So the move this essay is recommending is not "go invent traffic segmentation." You may already shed by criticality. The move is to go one step further and stress-test your composition the way a bank stress-tests its funding mix, because your real headroom is a function of the mix, not the total.
Here is the practical translation, and it's small enough to start this week.
Stop reporting capacity as a scalar. "We tested 100k QPS and it held" is the systems equivalent of "total deposits look fine." It is the number that passes your load test and then kills you in production, because a homogeneous load test specifically cannot surface a flighty-composition failure: every synthetic request behaves identically and politely, which is the one traffic mix you will never actually receive. SVB's balance sheet "passed" on totals, too.
Instead, compute a Liquidity Coverage Ratio for your service. Segment your traffic into its funding types (sticky paying users, flighty anonymous spike, demand-path synchronous, time-path batch) and assign each one a stress weight built from three factors the bankers would recognize: its retry multiplier (how much extra load it generates when it's failing), its correlation (how synchronized its surge is), and its abandonment rate (how fast it gives up, which paradoxically adds load through immediate retries). Then size your headroom not to the peak scalar, but to the stressed, weighted composition: enough capacity to survive a surge made disproportionately of your 100%-run-off segment.
And then stress-test the mix, not just the magnitude. Run the load test where the surge is 80% anonymous, aggressively-retrying, correlated bot traffic (your uninsured-wholesale scenario) and watch what your retry amplification does to a system that looked comfortable at the same total QPS of polite logged-in users. That gap, between the friendly-mix number and the hostile-mix number, is your real exposure. It is the gap SVB never measured.
Capacity planning has spent its whole life asking "how much load can we take?", a total question, the same question that made SVB's balance sheet look healthy until Thursday afternoon. Banking learned, the hard and expensive way, to ask a better one: "what is our load made of, and what happens to each part when everyone gets scared at once?" The total number is comfortable because it hides which surge is coming. Your composition is the thing that's actually going to decide whether you're standing on Friday morning.
You can only weight a segment you can actually attribute.
Computing a Liquidity Coverage Ratio for your service assumes you can tell which request belonged to which funding type, and what it actually did under stress: did it retry, how many times, in step with whom. When the traffic is autonomous agents, that attribution is exactly what a fleet-wide success rate hides, and a per-agent self-report cannot be trusted to reconstruct it. Chain of Consciousness anchors every agent action to a verifiable external record, so you can segment load by what each agent really did, not by what the dashboard chose to remember.
See a verified provenance chain · Hosted Chain of Consciousness
pip install chain-of-consciousness · npm install chain-of-consciousness