← Back to blog

The Cytokine Storm

When your defenses' own feedback loop kills the patient. A retry storm is a cytokine storm with a stack trace.

Published June 2026 · 11 min read

On the morning of March 13, 2006, eight healthy young men checked into a clinical trials unit at Northwick Park Hospital in London to earn a few hundred pounds testing a new drug. Six of them received it. The drug was TGN1412, an experimental antibody designed to wake up the immune system's T cells, and the dose was a fraction of what had been given safely to monkeys. Within roughly ninety minutes the first man began to shiver. Then came the fever, the crashing blood pressure, the vomiting, the swelling so severe the press would later call it the “Elephant Man” trial. Within hours all six were in intensive care with multiple organs failing. They survived, barely, but one of them lost parts of his fingers and toes to the damage.

Here is the detail that should unsettle anyone who builds systems for a living: there was no infection. No virus, no bacteria, nothing to fight. The drug didn't poison them. It flipped a switch that told their immune systems to respond, and the response, with nothing braking it, nearly killed six men who had walked in healthy. That is a cytokine storm in its purest possible form, stripped of any pathogen to blame, and it is the clearest demonstration of a rule that governs the worst failures in biology and in software alike: when things go truly wrong, the thing killing you is usually your own response.

The loop that turns a control system into a weapon

To see why this matters for your stack, you have to see what a cytokine storm actually is, mechanically. A healthy immune response is a negative-feedback control system, the kind engineers admire. Cytokines, the small signaling proteins like the interferons, the interleukins, and TNF-α, coordinate the body's cells toward a threat, ramp the response up to meet it, then damp it back down as the threat clears. Sense, respond, stabilize. Homeostasis.

A cytokine storm is that same control loop inverting into positive feedback. Macrophages release cytokines, which activate T cells, which release more cytokines, which activate more cells, which release still more: an exponential surge, made worse by a form of inflammatory cell death called PANoptosis that bursts cells open and spills yet more signal into the fire. The immunology literature names it without euphemism: a cytokine storm is “a failure of the immune system's regulatory mechanisms.” The control loop didn't break by going silent. It broke by becoming an amplifier.

And the consequence, documented across the worst cases of COVID-19 and sepsis, is the punchline this whole essay turns on: the immune response to the pathogen, not the pathogen itself, is what drives the multi-organ failure that kills. The landmark 2020 New England Journal of Medicine review by Fajgenbaum and June put it plainly: the response can cause the lethal dysfunction, and storms can even occur with no obvious infection at all. The virus, when there is one, lights the match. The fire is your own.

Every automated defense is a cytokine

Now look at a distributed system, because the structure is the same down to the feedback sign. A healthy system, like a healthy body, runs on negative feedback: controls that sense trouble and push back toward stable. Retries push back against a dropped packet. Failover pushes back against a dead replica. Autoscaling pushes back against rising load. Alerting pushes a human toward a problem. Each is a defense. Each is usually correct. And each can invert, exactly the way inflammation inverts, into the positive-feedback loop that becomes the outage.

The vivid one is the retry storm. When a service starts to degrade, every client whose request fails begins a retry timer. The retry queue grows silently, across every client, every layer, every region, while the service is down and no one is watching that particular number. Then the service recovers. The first health check passes, the load balancer reopens the pool, and in that first second of restored capacity the entire backlog of queued retries from every client on earth arrives at once and flattens it again. The recovery is the murder weapon. The system didn't fail to come back; it came back and was immediately re-killed by its own clients trying to be helpful.

Run the same logic through the other three defenses and you get the same shape. Failover sends a thundering herd of traffic stampeding onto the replicas you failed over to, toppling them in turn. Autoscaling spins up a wave of new instances that each open connections to the shared database, exhausting the connection pool and deepening the very overload that triggered the scale-up. And the alert storm buries the one signal the on-call engineer needs under ten thousand pages, blinding the responder at the precise moment sight matters most. In every case the outage stops being the original fault and becomes the response to it. A retry storm is a cytokine storm with a stack trace.

Metastability: the part that makes this a theorem, not a metaphor

The cytokine-storm-as-retry-storm comparison is, honestly, not new; it has been blogged a hundred times. What lifts it from a clever pairing to something rigorous is a specific result, and it captures the cruelest, least intuitive property these failures share.

In 2021, four distributed-systems researchers (Nathan Bronson, Abutalib Aghayev, Aleksey Charapko, and Timothy Zhu) published a paper at the HotOS workshop naming the shape: the metastable failure. Its defining feature is this: a sustaining effect keeps the system trapped in the bad state even after the original trigger is removed. You find the bug that started the outage and fix it. You restore the dependency that died. And the system stays down anyway, because the loop no longer needs the trigger. The retry backlog is now feeding itself; the overload sustains the overload. The authors trace a long list of the worst outages at AWS, Azure, and Google Cloud to exactly this pattern, and the reason these incidents are so notorious is precisely that the obvious fix, remove the cause, does nothing.

The biology has the identical, eerie property, and it is the single best line of evidence that this is one phenomenon and not two. A cytokine storm can persist, or even worsen, after the pathogen has been cleared. The virus is gone; the lab confirms it; the storm rages on, because the loop has decoupled from its cause and is now driving itself. Metastability is the cytokine storm's formal twin. And it carries the most important operational lesson here, the one that contradicts every instinct: in a metastable failure, fixing the original fault does not bring you back. You cannot wait out the trigger, because the trigger is no longer the problem. You have to treat the response.

The cure is to brake your own defense

So consider what physicians actually do when a patient is in a cytokine storm. They do not reach first for antivirals. The standard of care that emerged for severe COVID-19 was tocilizumab, a drug that blocks the IL-6 receptor, deliberately suppressing an immune signal rather than attacking the virus, alongside dexamethasone, a blunt, cheap immunosuppressant that the large RECOVERY trial showed reduced deaths in the severe, oxygen-dependent patients. The clinician's move in a storm is to put the brakes on the immune system itself.

The operational move is identical, and just as counterintuitive to the engineer mid-incident. In a retry storm, you do not scale the backend to fight the “pathogen,” that is throwing more capacity at a loop that will consume whatever you give it. You disable retries. You shed load. You drain the backlog. You suppress your own defense, because fighting harder feeds the loop and braking starves it. Every reflex says add capacity, restart, push through. The reflex is wrong, in the same way and for the same reason that pouring antivirals on a post-clearance cytokine storm is wrong: the cause is already gone, and the response is the disease now.

But the brake can kill too, and this is the part nobody finishes

Here is where most versions of this advice stop, at “add a circuit breaker,” and here is where the honest version begins, because the brake is itself a tunable risk, and mis-set in either direction it is lethal.

The clinical evidence on this is sharp and worth sitting with. Dexamethasone helped the severely ill, and was useless, or actively harmful, in milder cases. And across the great majority of trials of anti-inflammatory drugs in human sepsis and COVID, suppressing the immune response has not improved survival. The reason, as the immunologists phrase it, is that an immune response has to be matched to the challenge. Brake too hard and you leave the patient defenseless against the actual infection; the suppression becomes its own cause of death, killing through the secondary infections that overrun an immune system you switched off. The brake is not a free intervention. It is a second control loop, and it needs to be as well-calibrated as the first.

The engineering mirror is exact, and it is the reason “just add circuit breakers” quietly fails. A circuit breaker tuned to trip too eagerly denies legitimate traffic and turns a partial, recoverable degradation into a full, self-inflicted outage. A retry budget set too low fails requests that would have succeeded on the second attempt. A load-shedder dropping the wrong ten percent sheds your highest-value customers. A breaker jammed permanently open is not safety; it is just a faster outage with a clearer conscience. The governor you add to damp the storm becomes, if you mis-tune it, a new way to fail, and the failure it causes is the kind that looks deliberate, which makes it harder to spot. This is the recurring, under-told lesson of every regulatory system the body runs: the damping layer needs calibration, not merely existence.

Build the regulatory-T-cell layer, then tune it

The body did not leave any of this to heroics or to noticing in the moment. It evolved a dedicated subsystem whose entire job is to damp its own defense: regulatory T cells, and the anti-inflammatory cytokines IL-10 and TGF-β, a standing brake held against the immune response at all times. That regulatory machinery is not a minor footnote of immunology; the discovery of regulatory T cells and the peripheral tolerance they enforce won the 2025 Nobel Prize in Medicine. The brake is first-class anatomy, built in, not bolted on after the first storm.

Your defenses need the same, designed as a deliberate layer rather than improvised during the incident:

Each of these is a regulatory T cell for a specific cytokine. And each, like a regulatory T cell, has to be tuned to the challenge: strong enough to stop the runaway, gentle enough not to disable the defense.

The practical version fits in one question, and you can run it against your architecture this afternoon. For every automated reflex you operate, every retry, every failover, every autoscaling rule, every alert, ask: what stops this one when it's firing wrongly? If the honest answer is “a human eventually notices and turns it off,” then you don't have a governor on that reflex. You have a volunteer at Northwick Park, healthy at 9 a.m. and hoping the response winds itself down before it does too much damage. Build the brake, put it on your own reflexes, and then do the part that is actually work: tune it, in both directions, against real load, because an unchecked defense is just a slower way to fail, and an over-checked one is a faster way. The body spent a few hundred million years learning to police its own immune system into something that protects more than it destroys. You have until the next traffic spike.


Sources: The TGN1412 first-in-human trial, Northwick Park Hospital, London, March 13, 2006: six healthy volunteers, a CD28-superagonist antibody, and a near-fatal cytokine storm with no infection present. Cytokine-storm biology and clinical course: David Fajgenbaum & Carl June, “Cytokine Storm,” New England Journal of Medicine (2020); Nature Reviews Disease Primers (2025) on the self-amplifying positive-feedback loop and PANoptosis; the principle that the immune response, not the pathogen, drives multi-organ failure. Treatment evidence: tocilizumab (anti-IL-6-receptor) as standard of care for severe COVID-19; dexamethasone's mortality benefit confined to the severe, oxygen-dependent cohort (RECOVERY trial), with anti-inflammatory therapy frequently failing in sepsis/COVID because the response must be matched to the challenge. The formal model: Bronson, Aghayev, Charapko & Zhu, “Metastable Failures in Distributed Systems,” HotOS 2021: a sustaining effect traps the system in a bad state after the trigger is removed. Distributed-systems governors: the retry storm / self-inflicted-DDoS mechanism; Michael Nygard, Release It! (circuit breaker); AWS SDK token-bucket retry throttling (2016) and the AWS Builders' Library on timeouts, retries, and backoff with jitter; Google SRE on addressing cascading failures. The body's regulatory layer (regulatory T cells, IL-10, TGF-β) and the 2025 Nobel Prize in Medicine for regulatory T cells and peripheral tolerance. The claim is shared feedback structure and a shared cure (damp the response, calibrated), not that suppressing immunity is simply good; the clinical record is explicitly severity-matched, which is the essay's point.

You can't tune a governor on a reflex you can't see.

The whole essay turns on calibration: a brake mis-set in either direction is lethal, and you tune it “against real load.” That requires a faithful record of what each automated reflex actually did, retried, shed, failed over, when it fired and how hard. For a fleet of autonomous agents, that record can't be the agent's own after-the-fact summary, because the reflex that stormed is the thing writing the summary. Chain of Consciousness anchors every action an agent takes to a tamper-evident record, so when a governor misfires you can see it happen instead of waiting for a human to notice the smoke.

See a verified action chain · Hosted Chain of Consciousness

pip install chain-of-consciousness  ·  npm install chain-of-consciousness