The Therapeutic Window

Every config has a dose too low to work and a dose that harms. The narrow ones need monitoring, and your retry count is lithium.

Published June 2026 · 11 min read

Lithium keeps millions of people alive. It is one of the oldest and most effective treatments for bipolar disorder, and it is also, at not much more than the helpful dose, a poison that can damage your kidneys, scramble your nervous system, and kill you. The therapeutic serum range is usually given as roughly 0.6 to 1.2 milliequivalents per liter. Toxicity starts creeping in above about 1.5, and above 2.0 you are in real danger. The dose that heals and the dose that harms nearly touch.

So we do not just prescribe lithium and walk away. Patients on lithium get their blood drawn on a schedule, for years, because the only safe way to use a drug whose helpful and harmful doses are a sliver apart is to measure the level continuously. Warfarin, the blood thinner, is the same story, and so is digoxin for the heart. Medicine has an entire specialty for this, therapeutic drug monitoring, and it exists for one reason: some drugs have a window so narrow that you cannot trust a one-time dose to stay safe.

Here is the thing every operations team eventually learns the hard way. Your retry count is lithium. So is your rate limit, your connection-pool size, and your autoscaler threshold. Each one has a dose too low to work and a dose that harms, the gap between them is sometimes a sliver, and most teams tune them as if more were simply better and then never check the level again. Pharmacology solved this problem five hundred years ago. The discipline transfers exactly, and almost nobody in software has imported it.

The question pharmacology asks that you don't

The founding axiom of toxicology is older than the United States. In 1538 the physician Paracelsus wrote Sola dosis facit venenum: the dose alone makes the poison. All things are toxic, he argued, and nothing is without toxicity; whether a substance harms you is a property of how much, not what. This is the sentence under everything that follows, and it has a blunt corollary that ops engineers resist instinctively: even the good things have a toxic dose. Water causes fatal hyponatremia if you drink enough fast enough. Pure oxygen poisons. There is no substance whose dose-response curve only goes up.

Because of that, pharmacology does not ask the question you ask about a config. You ask “is this a good value?” Pharmacology asks something sharper: how wide is the gap between the dose that works and the dose that harms? That ratio has a name, the therapeutic index, defined roughly as the toxic dose for half the population divided by the effective dose for half the population. A high ratio means a forgiving drug, like penicillin, that you can dose generously and not worry about. A low ratio means a narrow-therapeutic-index drug, like lithium, where a small overshoot is dangerous. The whole window sits between the minimum concentration that does anything and the minimum concentration that hurts. Below it, no benefit. Above it, harm. Only the band between is medicine.

And critically, the benefit curve saturates. Past the effective level, more drug buys you almost nothing extra in effect, while toxicity keeps climbing. The relationship between dose and good outcomes is not a line that keeps rising; it is a curve that plateaus while the harm curve does not. “Bigger dose, better result” is false past the peak, and it is false in exactly the place where your intuition is most confident that more must be safer.

Every knob is a drug with a window

Now look at your config file with that lens, and the knobs stop looking like settings and start looking like prescriptions, each with an effective floor and a toxic ceiling.

The retry count is the cleanest case. Too few retries and transient blips that you could have masked turn into user-visible errors; you are below the effective dose. Too many, and you get the toxic dose with a name: the retry storm. A dependency slows down, every client retries, the retries multiply the load, the extra load slows the dependency further, and the system locks into a self-sustaining overload that does not clear even after the original trigger is gone. Researchers call these metastable failures, and they are a documented, recurring cause of large outages: the system has two stable states, healthy and collapsed, and too aggressive a retry policy is what tips it over and holds it there. You added retries to defend against failure and manufactured a bigger one. That is the dose making the poison, in production.

The pattern repeats across every operational knob. A rate limit too loose lets abuse and overload through; too tight, it poisons legitimate traffic with false rejections. A timeout too short fails requests that would have succeeded and wastes work on retries; too long, and slow calls pile up threads and connections until the resource exhaustion cascades. A connection pool too small starves requests into a queue; too large, and you overwhelm the database with a thundering herd. The garbage-collection heap too small thrashes in constant collection; too large, and a single stop-the-world pause stalls everything for seconds. The autoscaler threshold too sensitive flaps up and down and thrashes; too sluggish, and you are under-provisioned right when the load arrives. Every one of them saturates the same way a drug does: past the effective point, more of the knob buys diminishing benefit and rising toxicity.

If you only take one reframe from the pharmacology, take this: the failure is not “we set the wrong value.” The failure is treating the dose-response as monotonic, when it is a window.

The discipline that actually transfers

“Configs have a sweet spot” is old SRE folklore, and if that were the whole essay it would be useless. The pharmacological discipline is more specific and more actionable than that, and it comes in three parts that most engineering teams do not practice.

First: measure the index, not the value. Do not just hunt for a good retry count. Find the gap between the toxic ceiling and the effective floor, because that ratio, not the setting, tells you how much trouble the knob can cause. A cache time-to-live is usually a penicillin-class knob: a wide forgiving band where roughly-right is fine and you can set it and forget it. A retry budget on a hot path can be a lithium-class knob: a narrow band where one extra retry under load is the difference between healthy and a storm. Those are not the same kind of object, and the difference is not the value you chose, it is the width of the safe band relative to how much things drift, the property that matters, and the one almost nobody measures.

A necessary honesty here: you will not calculate a real toxic-dose-over-effective-dose number for your retry count the way a pharmacologist does for a molecule. Config dose-response curves are messier and rarely have clean medians. Use the therapeutic index as a lens, not a literal formula: ask how wide the safe band is, and how fast the system drifts across it. The frame is what transfers, not false precision.

Second: triage your monitoring by narrowness. This is engineering's missing therapeutic drug monitoring. You cannot continuously instrument every config; medicine does not draw blood to check penicillin levels either. The whole point of triage is that you monitor the narrow drugs intensely and dose the wide ones casually. The engineering twin barely exists. Most teams either alert on nothing config-specific or try to alert on everything and drown in noise. The discipline is to identify your narrow-window knobs, the storm-prone ones, retries and pools and autoscalers, and put continuous, specific monitoring and alerting on those, the way a clinic watches a lithium patient and not a penicillin one. Per-knob alerting, triaged by therapeutic index. That sentence is the actionable core of this whole piece, and it is under-practiced precisely because the framing that makes it obvious has not crossed over from medicine.

Third: titrate, do not copy. Clinical dosing is empirical; you start low and titrate up toward the window, checking as you go. The engineering equivalents are load tests, canaries, chaos experiments, and progressive rollouts. What you must not do is the thing everyone does: copy a retry count or a pool size from someone else's blog post or postmortem. Those numbers are population medians from a different patient. The effective-and-toxic doses in a study are medians across people; an individual's window differs with their genetics and kidney function. Your service's window for a given knob differs with your traffic shape, your dependencies, and your hardware. Copying a config without testing it on your own system is prescribing a powerful drug without checking the patient.

The kicker: the window moves

Here is the part that is both the most important and the most defensible, the part that does not depend on the metaphor being precise at all. A drug's therapeutic window is not fixed; it moves with the patient. Kidney function changes it, age changes it, and other drugs change it: warfarin's already-narrow window narrows further when you add certain medications, which is why a stable warfarin patient can drift into danger after an unrelated prescription. The window you titrated to last year is not the window today, because the patient changed.

Your configs are identical. A retry count that was therapeutic last quarter is toxic now because the “patient” changed underneath it: traffic doubled, a dependency got slower, a deploy shifted the latency profile, a downstream service degraded. The setting did not move. The window moved out from under the setting. This is the real reason one-time tuning fails and continuous monitoring is required. A narrow-window parameter is, by definition, the one that drifts into toxicity without warning, so it needs a live reading, the blood-level test, not a value you picked once and trusted forever. When people say “but we load-tested that retry policy,” they are describing a single old blood draw on a patient whose chemistry has since changed.

And the windows interact, which makes it worse and more interesting. Configs are not independent drugs taken in isolation; they are polypharmacy. Tighten a timeout and you have just narrowed the retry budget's safe band, because there is now less time for each retry. Enlarge a connection pool and you have shifted the garbage collector's window, because more concurrent work changes the memory pressure. There is no such thing as tuning one knob; you are always adjusting one drug in a regimen, and the interactions move the other windows. The teams that get burned change a single setting in isolation and are surprised when something three components away tips over.

A little poison is medicine

There is one final twist from the source domain, and it points at the most counterintuitive practice in modern operations. In 1888 the pharmacologist Hugo Schulz noticed that small doses of poisons could stimulate yeast growth rather than suppress it. The phenomenon, hormesis, is a biphasic dose-response: a little of a stressor is beneficial, a lot is toxic. (A caution is in order, because the idea gets abused to argue that low doses of genuinely dangerous things are healthy; the honest version is narrow, that a small, controlled dose of stress can produce a beneficial adaptation.)

That narrow, honest version is exactly what the best operations teams do on purpose. A small, deliberate dose of failure makes a system sturdier, while a large dose breaks it. Netflix's Chaos Monkey, introduced in 2011, randomly kills production instances so that engineers are forced to build services that survive instance death; the small dose of failure is the medicine. Google's Chubby lock service has been deliberately taken down when its real-world uptime exceeded its promised level, specifically to stop other teams from quietly depending on more reliability than it guaranteed; a planned outage as a vaccine against hidden over-dependence. Load-shedding, where a system intentionally drops some requests to stay healthy under pressure, is the same hormetic logic: a small dose of self-inflicted failure to prevent the large one. Resilience, it turns out, is hormetic. The dose makes the poison, and the dose also makes the cure.

What to do Monday

The practical version is a triage protocol, not a tuning guide. Stop asking whether each config value is good and start asking how wide its safe band is relative to how fast your system drifts. Suspect the storm-prone knobs first, retries and pools and autoscaler thresholds and rate limits, because that is where the toxic dose sits closest to the effective one and where overshoot self-amplifies. Find each window by titration, with canaries and load tests and a little deliberate chaos, never by copying a number from someone whose patient was not yours. Then put your continuous monitoring on the narrow-window knobs specifically, the way a clinic watches the lithium patients and sends the penicillin patients home. And keep that monitoring ongoing, because the window moves: the retry policy that was medicine last quarter is the outage waiting this quarter, since traffic and dependencies are a patient whose chemistry never stops changing.

Resilience has a toxic dose too. The next time you reach to add one more retry because surely more can only help, remember the oldest rule in toxicology, the one software keeps relearning in postmortems: the dose makes the poison, and that is as true of your retries as it is of lithium.

Sources: Paracelsus, Sola dosis facit venenum (“the dose makes the poison”), 1538, the founding principle of toxicology; lineage traced in “The dose response principle from philosophy to modern toxicology” (PMC6226566). Therapeutic index and therapeutic window, TD50/ED50 ratio, minimum effective vs. minimum toxic concentration, saturating (Emax) dose-response (standard pharmacology references; MedicalNewsToday, “What is the therapeutic index of drugs?”; ScienceDirect Topics, “Therapeutic Window”). Narrow-therapeutic-index drugs and therapeutic drug monitoring, warfarin, lithium, digoxin, gentamicin require serum-level monitoring; lithium therapeutic range commonly cited ≈0.6 to 1.2 mEq/L with toxicity above ≈1.5 (approximate clinical consensus; targets vary by indication). Hugo Schulz (1888) and hormesis, the biphasic dose-response; Kendig, Le & Belcher, “Defining Hormesis” (2010), used only for the “small controlled dose of stress can be beneficial” point. Bronson et al., “Metastable Failures in Distributed Systems” (HotOS 2021), retry amplification as a sustained-overload failure mode. Netflix, Chaos Monkey / chaos engineering (introduced 2011), deliberate failure injection as a resilience practice. Google SRE, the Chubby lock service's deliberate planned outages to enforce its SLO and prevent hidden over-dependence (Site Reliability Engineering, O'Reilly, 2016). General toxicity-of-the-benign references, water intoxication (hyponatremia) and oxygen toxicity, as illustrations that even essential substances have a toxic dose.

The window moves, so a one-time check is never enough. Trust works the same way.

The whole essay turns on one rule: a narrow-window parameter drifts into danger without warning, so it needs a live reading, not a value you verified once and trusted forever. An agent's trustworthiness is exactly that kind of parameter. Verified once at onboarding, it drifts, the model changes, the behavior shifts, the dependency degrades, and the one-time check goes stale while you keep depending on it. The Agent Trust Stack is the continuous version: portable identity so you know which agent you're dealing with, reputation built from verifiable past outcomes, and a tamper-evident record of what it actually did, checked on every interaction rather than assumed from a single old draw.

See a verified action chain

pip install agent-trust-stack · npm install agent-trust-stack

← Back to all posts