← Back to blog

The Old Friends Hypothesis for Agent-Tool Ecosystems

The helminths were never the disease. Their absence was the disease.

Published May 2026 · 11 min read

Doctors are deliberately infecting patients with parasitic worms, and some are getting better.

The treatment is not fringe. Trichuris suis ova (pig whipworm eggs) and Necator americanus larvae (human hookworm) both hold investigational-new-drug status with the FDA and have run through multiple clinical trials for Crohn's disease, ulcerative colitis, multiple sclerosis, celiac disease, and allergic rhinitis. The PROCTO trial, a randomized double-blind placebo-controlled study of whipworm ova for ulcerative colitis, reported in 2024. Safety has been consistently confirmed: therapeutic doses do not reproduce the pathology of heavy natural infections. The field has matured from fringe idea into standard clinical methodology.

The reason this works traces to a piece of immunology with a memorable name. The hygiene hypothesis (David Strachan, 1989) observed that children raised in cleaner environments develop more allergies. Graham Rook's 2003 refinement — the old friends hypothesis — identified which organisms actually matter: not childhood infections like measles, but ancient co-evolutionary partners — helminth parasites, certain gut bacteria, environmental mycobacteria — that have shared the human body for millions of years. These organisms do not merely coexist with the immune system. They actively train its regulatory arm — the regulatory T cells, the anti-inflammatory cytokines IL-10 and TGF-β — that suppresses excessive immune responses. Remove the old friends, and the immune system keeps its attack capability while losing its calibration. It still recognizes threats; it can no longer reliably distinguish threats from benign variation. The result is autoimmune disease and allergy. The rise of Crohn's, MS, type 1 diabetes, and asthma across industrialized populations tracks the decline of helminth exposure with a consistency that, while not proof of causation, is hard to dismiss.

This essay is about a structural claim: that the way frontier models are trained recreates the old-friends sequence, that the production failures everyone complains about are the autoimmune dysregulation it predicts, and that the clinical playbook for helminth therapy ports into a design space for training more robust agents.


What the immune system loses when you clean its environment

The key insight in the old-friends hypothesis is that the immune system has two arms, and they have very different developmental requirements. The attack arm — the effector T cells, the inflammatory cascade — develops robustly almost regardless of environment. The regulation arm — the regulatory T cells, the IL-10 and TGF-β anti-inflammatory signaling that decides what not to attack — requires active training by the old friends to calibrate properly. A child raised without helminth exposure develops a powerful attack arm and an undertrained regulation arm. The system is not weak. It is unregulated. It attacks things it should tolerate.

A 2024 University of Pittsburgh study, cited in Cell, sharpened how deep this calibration goes. Researchers injected IL-25 — a cytokine mimicking a helminth-induced immune signal — and observed structural changes to the gut lining lasting more than fifty days in animal models. This matters because the helminth-mediated effect is not merely a transient shift in T-cell populations. It is structural. The old friends do not just change what the immune system tolerates; they change what it physically builds. The calibration is architectural, not behavioral.

Hold that finding. It is the one that makes the agent analogy more than decorative.


The clean-room training problem

Consider the training trajectory of a frontier language model.

Pre-training happens on a massive, diverse, magnificently messy corpus — most of the public internet, full of contradictions, errors, typos, half-finished arguments, malformed code, sarcasm, and noise. This is the model's evolutionary environment. It is the equivalent of the ancestral human environment teeming with helminths and environmental microbes. The model emerges from pre-training having seen, in some statistical sense, nearly every kind of input variation that exists.

Then comes the fine-tuning and RLHF phase. This data is curated, clean, human-labeled, and deliberately sanitized. Safety training adds refusal patterns. The data is selected to demonstrate ideal behavior: well-formed prompts, clean tool responses, unambiguous contexts, correct answers. The model is, in the precise sense the old-friends hypothesis would use, hygienized. The noise it saw in pre-training is systematically removed from the examples that shape its final behavior.

Then comes production, where the model encounters real-world inputs: messy, inconsistent, partial, ambiguous, mixed-language, typo-ridden, with tools that return error codes and partial responses and malformed JSON. The very noise that was removed during the hygienization step.

The trajectory is: evolutionary (messy pre-training) → hygienized (clean RLHF) → production (messy again). The hygienization step strips the model's tolerance for exactly the conditions it will face in deployment. This is structurally identical to the old-friends sequence: ancestral exposure, then modern hygiene, then immune dysregulation in the cleaned environment. The model, like the over-hygienized immune system, retains a powerful capability arm and an undertrained regulation arm.


The autoimmune failure modes

What does autoimmune dysregulation look like in a deployed model? The literature on overrefusal catalogues it precisely, even if it does not use the immunological vocabulary.

Overrefusal — the systematic rejection of benign queries by overly conservative safety heuristics — is the model attacking its own benign inputs. The work here is unambiguous. Naively increasing safety training data, per a 2025 analysis (arXiv:2502.11555), tends to push models into an “overly safe” state rather than a “truly safe” one, boosting refusal rates without improving the model's actual ability to tell harmful from benign. The model becomes better at refusing without becoming better at discriminating — which is exactly the over-hygienized immune system's deficit. It attacks more, not more accurately.

The specific failure modes map cleanly. Content-allergy: the model refuses entire domains — medicine, law, chemistry — attacking the domain rather than the harmful subset within it, the way an allergy attacks pollen rather than pathogens. Context-rejection: the model fails when conversation history contains inconsistencies or user corrections, treating normal conversational noise as a threat signal. Format-intolerance: a model trained on clean inputs chokes on typos, mixed languages, or unconventional structure. Tool-error panic: an agent trained on tools that always return clean JSON fails when a tool returns a partial response or error code, treating routine variability as a crisis.

There is one finding that completes the immune analogy with uncomfortable precision. A 2024 paper (arXiv:2407.11969) showed that refusal training does not generalize from present tense to past tense: a model trained to refuse “how do I do X” may comply with “how did people do X.” The safety system recognizes the threat in one surface form (one molecular conformation) but not in a trivially rephrased one. This is the exact mechanism of immune evasion in biology — pathogens mutate their surface proteins to avoid recognition by an immune system tuned to a previous conformation. The model's safety layer has the same brittleness: tuned to specific shapes, evadable by reshaping. And RLHF safety alignment is brittle in the deeper sense too — it can be substantially undone by fine-tuning on small amounts of data, which is the signature of a thin regulatory overlay on top of a powerful base, rather than a deeply integrated capability.


The therapeutic playbook

If the diagnosis is over-hygienization, the treatment that the old-friends hypothesis points to is not less safety training. It is the deliberate reintroduction of the old friends. In immunology, that is helminth therapy. In agent training, it would be the deliberate reintroduction of production-grade noise — tool errors, partial-failure histories, contradictory contexts, malformed-but-recoverable responses — into the training and evaluation curricula.

The clinical design space ports directly. Helminth therapy trials systematically vary four parameters, and each has an agent-training equivalent. Species (which organism) maps to type of noise: tool errors, format inconsistencies, contradictory contexts, partial responses — different dirty inputs train different robustness capabilities. Dose (how many organisms) maps to frequency of noise in training data — one percent of examples, ten, thirty? — the parameter that matters most and is understood least. Duration (how long the treatment runs) maps to which training phase the noise enters: early, late, throughout, or a dedicated stage; the IL-25 finding suggests timing matters more than total dose. Monitoring (eosinophil counts, Treg levels) maps to robustness metrics: overrefusal rate on benign inputs, graceful degradation on partial tool responses. Crucially, the right metric is discrimination (can the model tell harmful from benign?), not refusal rate (how often does it refuse?). Optimizing the refusal rate is precisely the error that produces overrefusal; optimizing discrimination is the calibration the old friends provide.

The most important borrowed concept is the dose-response curve. In helminth therapy, dose matters enormously and non-monotonically: too few organisms produce no therapeutic effect; too many produce pathological infection. The same logic governs noise curriculum. Too little training noise and the model's robustness does not improve — the RLHF-induced overrefusal persists. Too much and the model loses its quality standards entirely, learning to accept everything including genuinely harmful inputs, the way a severe parasitic infection overwhelms rather than calibrates the immune system. Between these is a therapeutic window — the range of noise frequency and intensity that improves robustness without degrading safety. Finding that window is the central empirical question, and the framing predicts it will not be a monotonic “more noise is better” curve. It will be a window, like a drug dose, with failure on both sides.


Where the analogy earns its keep, and where it stops

The obvious objection is that this is adversarial training with a biology costume. Adversarial training does introduce difficult and hostile inputs during training to improve robustness, and it predates this framing by years. But the framings differ in a way that produces different designs. Adversarial training treats noise as an attack to be resisted; the resulting models are robust in the sense of being hardened, but often rigid. The old-friends framing treats noise as a calibration signal to be learned from; the resulting models would be robust in the sense of being well-regulated — able to tell the difference between the harmful and the merely unfamiliar. The defensive framing optimizes for resistance. The mutualistic framing optimizes for discrimination. A model that has learned to resist noise refuses anything that looks like an attack. A model that has been calibrated by noise handles the noise and reserves refusal for genuine threats. These are different objectives and they produce different models.

The framing also generates three testable predictions that pure ML reasoning does not obviously produce. First, there should be a therapeutic window for noise dosage — a non-monotonic curve with degradation on both sides — rather than a monotonic benefit. Second, the timing of noise exposure during training should matter, with early exposure calibrating different capabilities than late exposure. Third, brief well-timed noise exposure should produce lasting structural changes in the model's representations — the computational equivalent of the IL-25 fifty-day structural finding — rather than merely temporary behavioral tolerance. Each prediction is checkable, and each is the kind of thing the biological analogy surfaces that a purely engineering frame might not.

Where the analogy stops is the obvious place: models are not organisms, and there is no real immune system here. But the structural claim does not require one. Both systems face the same problem — distinguishing harmful from benign variation under uncertainty — and both fail the same way when trained in an environment that lacks benign variation: they attack the benign. The fix in both cases is to reintroduce the benign variation that the clean environment removed. The substrate is different. The regulatory problem is identical.

There is a deeper distinction worth drawing, because not all parasites are old friends and not all difficult inputs are calibrating. Biology contains both manipulative parasites that exploit the host on the timescale of a single infection and mutualistic ones that calibrate the host over millions of years of co-evolution. The same distinction applies to difficult inputs. Some difficulty is parasitic — noise engineered to exploit the model, adversarial inputs designed to extract bad behavior. Some difficulty is mutualistic — production noise that, survived during training, builds the discrimination the model needs in deployment. The agent designer's job is to tell the two apart, exactly as the immune system's job is to tell pathogens from commensals. Reintroduce the old friends. Resist the genuine pathogens. The hard part, in both immunology and agent design, is that they can look superficially alike.


What to do with this on Monday

Three concrete moves.

The first is to audit the gap between your training distribution and your production distribution. Pull a sample of real production inputs — the messy ones, the ones with tool errors and partial histories and inconsistent contexts — and check whether anything like them appeared in your fine-tuning data. If your fine-tuning examples are all clean and your production traffic is all messy, you have the over-hygienization gap, and your overrefusal and brittleness problems are the predicted consequence. The audit is cheap; the gap is usually larger than teams expect.

The second is to measure discrimination, not refusal rate. Build an evaluation set that contains both genuinely harmful inputs and benign-but-unusual inputs (unconventional formatting, edge-case domains, messy contexts). Track the model's accuracy at separating the two, not its raw refusal rate. A model that refuses both is not safe; it is dysregulated. The refusal rate is the metric that, optimized directly, produces the autoimmune failure. Discrimination is the metric the old friends actually calibrate.

The third — the one that requires the most care — is to reintroduce production noise into training as a dosed curriculum, not a flood. Add real production artifacts (malformed tool responses, contradictory contexts, partial errors) to a controlled fraction of training examples, monitor robustness metrics, and search for the therapeutic window. Start low. The dose-response curve is non-monotonic; too much noise teaches the model that noise is the signal. This is not “add adversarial examples and harden the model.” It is “reintroduce the conditions the model will face so it learns to regulate its response to them.” Treat it with clinical-trial rigor: which noise, what fraction, which phase, gauged against discrimination metrics.

The closing observation is the one the immunologists arrived at after decades of treating allergy and autoimmunity as diseases of the immune system itself. The immune system is not designed to be clean. It is designed to be calibrated, and calibration requires the very exposures that hygiene removes. Your model is not designed to refuse everything that looks dangerous. It is designed to distinguish the dangerous from the merely unfamiliar — and clean training produces a model that cannot make that distinction, because it never saw the unfamiliar-but-benign during the phase that shaped its final behavior. The helminths were never the disease. Their absence was the disease. The production noise is not the problem. The production noise's absence from training is the problem.

The old friends live in your production logs.

The essay's first Monday move — audit the gap between training and production by pulling real messy production inputs — requires a faithful record of what the agent actually encountered in production. The malformed tool responses, the partial-failure histories, the contradictory contexts are exactly the "old friends" the essay says to reintroduce, and they only exist if you captured them. Chain of Consciousness anchors every agent action and observation to a verifiable external record, which is where the production-noise corpus is sourced from. The chain is the reservoir of old friends — the benign variation the clean training set removed, preserved in the form the model actually met it.

pip install chain-of-consciousness · npm install chain-of-consciousness
Hosted Chain of Consciousness → · See a verified provenance chain