The bug that survives every retry isn't resistant, it's dormant. Stop hitting the part that's awake.
In 1944, a physician named Joseph Bigger was working with penicillin in Dublin, and he ran an experiment that should have been boring and instead became a footnote that took eighty years to come fully due. He took a culture of Staphylococcus, hit it with penicillin, and watched almost all of it die. Almost. A tiny remnant survived. So far, unremarkable: survivors are what you expect from a drug. The interesting part was what happened next. Bigger took those survivors, let them grow back into a full culture, and hit that culture with penicillin again. And it died. All of it, just like the first time, just as thoroughly. The survivors had not become tougher. Their offspring were exactly as vulnerable as their ancestors. Whatever had let that remnant live through the first treatment, it wasn't something they passed on, and it wasn't something the drug had failed to overcome by force.
Bigger called them "persisters," and the word he chose was precise. They didn't resist the penicillin. They persisted through it. The distinction sounds like hairsplitting and is in fact the whole story, not just of chronic infection, but of the most maddening class of bug you will ever chase: the one that survives every restart, every retry, every careful remediation, and comes back identical, as if you had done nothing at all.
To see why persisters matter, you have to understand how most antibiotics actually kill. They don't work like poison, attacking a fixed structure that's always there. They sabotage processes: specifically, the processes of growth. Beta-lactams like penicillin jam the machinery of cell-wall construction; the cell tries to divide, fails to build a wall, and bursts. Quinolones corrupt DNA replication; the cell tries to copy its genome and tears itself apart. These drugs are devastating against bacteria that are doing the one thing bacteria mostly do, which is multiply. The killing requires growth. The target has to be moving.
A persister cell stops moving. It drops into a state of dormancy or quiescence: it stops dividing, throttles its metabolism, and essentially does nothing. And here is the trick that has kept clinicians up at night since the 1940s: a cell that is doing nothing presents nothing for a growth-targeting drug to attack. There is no wall being built to jam, no genome being copied to corrupt. As the field's 2019 consensus paper in Nature Reviews Microbiology puts it, in non-growing cells the antibiotic's targets are "recalcitrant": not defended, just absent from the fight. The penicillin sweeps through, lethal to everything awake, and washes harmlessly over the handful that are asleep. Then the treatment ends, the drug clears, the dormant cells wake up, and they rebuild the entire population from the remnant. The infection relapses, genetically identical, fully susceptible, and completely undefeated.
This is worth holding onto because it inverts the intuition. The survivors are not the strong ones. There is no arms race here, no cleverer enzyme, no thickened membrane. The survivors are the ones that opted out of the battle the drug knew how to fight. Kim Lewis, who did much of the modern work that revived Bigger's footnote into a discipline, framed it in his 2007 review as the central puzzle of why infections recur: the tolerance "is not inherited and is reversible. Persisters are not mutants, but rather dormant cells." Strength was never the variable. Activity was.
The beautiful thing about persistence, beautiful if you're a diagnostician, infuriating if you're a patient, is that it leaves a fingerprint. If you plot the survivors against time under treatment, resistance and persistence look completely different. A genuinely resistant strain gives you a flat-ish line: the cells keep living because they can actually grow in the presence of the drug. But a persister population gives you a distinctive two-phase curve. The bulk of the population dies fast, a steep cliff, and then the curve bends and trails off into a long, slow, stubborn tail of survivors that the drug just cannot seem to finish. Microbiologists call it the biphasic killing curve, and it is the signature that says: what you're looking at is not resistance. It's dormancy.
The vocabulary here is load-bearing, so it's worth getting exact, because the field fought to make these words mean distinct things. Resistance is a heritable change that lets a cell grow at drug concentrations that would kill its ancestors: it actively defeats the drug and passes that ability on. Tolerance is when a whole population dies more slowly without being able to grow under the drug: it buys time, not victory. And persistence is a special case of tolerance: not the whole population, but a subpopulation that survives far better than the rest. That subpopulation is the tail of the biphasic curve. The cleanest one-line distinction, from the same 2019 paper: resistant mutants are capable of dividing during treatment, while persisters can only survive. One fights. The other waits.
Now read that sentence again with your engineering hat on, because you have met both of these bugs, and you have almost certainly confused them.
When a fault survives every restart, retry, and remediation, the reflex is to call it "hard": a deep, stubborn, well-defended bug that's resisting your fix. Sometimes that's true. But far more often, the fault that laughs at your retries isn't resistant at all. It's dormant. And it's surviving for exactly Bigger's reason: your remediation, like the antibiotic, only hits the part of the system that's active.
Consider the cleanest case, one that's documented in public on the anthropics/claude-code issue tracker (issue #23081): a connection drops, the client's retry logic kicks in, and the retry keeps failing in precisely the same way. The reason is that the retry reuses stale connection state: the HTTP/2 connection pool, the cached TLS session, the DNS entry. The "fresh" attempt isn't fresh at all. It reaches back into the pool, picks up the same poisoned connection, and re-runs the failure. The fix, as the issue describes, isn't to retry harder or more often: it's to destroy and recreate the client, or at minimum flush the pool and the TLS cache, before retrying. Your retry was the antibiotic. It struck the active code path with full force. And the dormant poisoned connection, sitting idle in the pool, doing nothing, survived precisely because it was doing nothing when you struck.
The pattern is everywhere once you see its shape. There's the connection pool that hands back a dead connection after a quiet period: a fault that is, in the words of one practitioner write-up, "intermittent, and only happens after a period of idleness when traffic picks back up." It's asleep during the calm and revives on load: relapse, exactly. There's the database replica that's offline or lagging while you perform your fix on the primary, sitting there quiet and uninvolved, and then it reconnects, wakes up, and syncs its stale, corrupt state back into the system you just cleaned. There's the cached bad config, the queued-but-not-running job, the sleeping cron, the in-memory singleton holding a poisoned value from before the fix. Each of these is a place a fault can be dormant during your remediation. Each is a biofilm of one.
And there's the tell that should make the hair on your neck stand up, because it's the biphasic curve rendered in operations: the restart that helps but doesn't last. Microsoft's own support docs note, about connection-pool exhaustion, that stopping and starting the service "helps clear stale connections... though this is typically a temporary solution rather than addressing root causes." That is the two-phase killing curve described in plain English. You restarted, you killed the active population, the problem dropped away, the steep cliff, and then the dormant reservoir revived and the failure trailed back in. A fix that works beautifully for ten minutes and then relapses under load is not a flaky fix. It is a persister, waking up.
Here is where the biology earns its keep, because it tells you that resistance and persistence are different bugs that need opposite responses, and that mixing them up is the specific error that keeps you fighting forever.
When a fix doesn't stick, the human instinct is to escalate force. Retry harder. Restart more aggressively. Add a backoff, then add more retries to the backoff. Throw capacity at it. This is exactly the right move against a resistant fault: one that actively defeats your fix and propagates under it, a real logic error that survives and spreads while your remediation runs. A resistant fault needs a better fix, and more force can be part of finding it.
But against a persister, every bit of that escalated force is wasted, and worse than wasted: it's invisible. Force tuned to the active population cannot touch the part that's asleep, by definition, because the thing that makes the persister survive is that it isn't participating in the process your force acts on. You can triple your retries against a stale connection pool and you will simply re-serve the poisoned connection three times as fast. You can restart the service every hour and the lagging replica will re-seed the corruption every hour after. The persister isn't beating your fix. It's ignoring it, the way a sleeping cell ignores a drug that only kills cells that are awake. Hammering the active symptom harder while the dormant pool waits you out is the single most common way good engineers lose days to a bug that a five-minute cache flush would have ended.
So the diagnostic question is not "how do I hit this harder?" It's "what part of this system was asleep while I was fixing it?"
There's a sting in the tail, too, and it's the reason you don't get to ignore the dormant reservoir even when the relapse is survivable. In biology, the persister pool isn't just a relapse risk: it's where worse trouble incubates. The 2014 work in Cell on persister mechanisms notes that persistence has been linked to a higher risk of true resistance emerging during treatment: the dormant survivors buy time, and time is what evolution needs to stumble onto a real, heritable defense. The dormant pool is where a temporary problem can become a permanent one. The software analogy isn't biology, but it's faithful: the cached corruption that sits around long enough to become the value everything else trusts; the stale replica that, because it was the one that happened to be up during the incident, gets promoted to source of truth. Leave the reservoir unflushed long enough and a fault that was merely recurring gets baked into structure. Clear it before it sets.
The good news is that infectious-disease medicine has spent eighty years learning how to beat an enemy that wins by waiting, and the discipline transfers almost line for line. It is not "use a stronger antibiotic." It is a sequence.
Read the curve before you escalate. If restarting or retrying kills most of the problem fast and then a stubborn fraction always comes back, that shape is the diagnosis. You are looking at a persister, not a resistant bug. Stop reaching for more force. More force is the wrong instrument.
Enumerate the dormant reservoirs before you re-fix. Make the list explicitly, every time, because the reservoir is by definition the part you weren't looking at: caches, queues, idle connection pools, sleeping crons, offline or lagging replicas, persisted session state, in-memory singletons. Anywhere the fault could be asleep while your remediation runs is a place to check.
Flush, don't hammer. Invalidate the caches. Drain the queues. Destroy and recreate the connection pool rather than retrying on it. Quiesce and rebuild the idle replicas rather than trusting them to heal. Eradicate the reservoir instead of escalating force on the symptom you can see.
Quiesce before you fix, the most clinical move of all. The reason chronic infections get a full course and not a single megadose is to keep the drug present while dormant cells wake into its range. The operational equivalent is to stop or drain the dormant population first, so it can't re-seed the corruption mid-repair. Take the lagging replica out of rotation, drain the queue, empty the pool: fix the quiet system, and only then bring the rest back. Fixing a system while a dormant reservoir is still feeding it is how you end up applying the same fix four times.
And verify across the relapse window. Don't declare victory at the restart, when the active population is freshly dead and everything looks clean. Confirm the fault doesn't return when load resumes and the idle parts wake, the moment the "antibiotic concentration drops." That's the software equivalent of finishing the full antibiotic course instead of stopping when you feel better, which is, not coincidentally, the exact behavior that breeds persisters in the first place.
Bigger saw it in 1944 and the rest of us keep rediscovering it in the small hours, staring at a bug that has survived a fix we know is correct. The thing to remember in that moment is that survival isn't strength. The fault that laughs at every retry usually isn't tougher than your fix: it's asleep during it. So stop hitting the part that's awake. Find the part that's dormant, and flush the reservoir.
When an agent fails the same way every retry, the dormant state is the bug. You have to be able to see it.
An agent that keeps relapsing into the same failure is rarely fighting your fix; it's reusing dormant state the retry never touched, a cached belief, a stale piece of context, a poisoned memory carried across attempts. You can only find the reservoir if you can read what the agent actually did each time, not just the surface retry. Chain of Consciousness is that record: a tamper-evident chain of an agent's reasoning, tools, and actions across attempts, so you can see the biphasic curve in its behavior, spot the state that survived the restart, and flush it instead of hammering the part that was already awake.
See Hosted Chain of Consciousness · verify an action chain
pip install chain-of-consciousness · npm install chain-of-consciousness