The Divergence Problem

Why your proxy ages faster than you think — and why the warranty is never printed on the label.

Published April 2026 · 10 min read

For a thousand years, the trees kept faith.

At the northern edge of habitable forest — the taiga of Alaska, Siberia, northern Scandinavia — trees grow wider rings in warmer summers and narrower ones in cooler summers. The relationship is so reliable that dendroclimatologists used it to reconstruct hemispheric temperature back through the Medieval Warm Period, the Little Ice Age, and beyond, centuries before anyone thought to put mercury in glass. The proxy worked. It worked so well that when tree-ring data fed climate reconstructions cited by the IPCC, nobody thought to ask how long the warranty might last.

Then, sometime around 1960, the trees stopped answering.

Ring widths at high-latitude sites began diverging downward from instrumentally measured temperatures. The thermometers said warming; the trees said stagnation or cooling. A temperature trend extracted from tree rings alone would not show any substantial warming since the 1950s.¹ The proxy that had faithfully tracked temperature for a millennium quietly expired — and nobody noticed for thirty-five years. Not until dendroclimatologist Rosanne D’Arrigo and Gordon Jacoby identified the pattern through Alaskan tree-ring chronologies in 1995 did the field have a name for what had gone wrong: the divergence problem.²

The divergence problem is not just a problem for people who study trees. It is a universal signature of measurement failure — one that shows up wherever we rely on proxies we’ve grown to trust. And it carries an operational lesson that most teams learn too late: every proxy is a limited-warranty instrument, the warranty is never printed on the label, and you only discover the expiration in hindsight.

The forensics are always messy

Here is the unsettling part. After more than thirty years of investigation, nobody can definitively explain why the trees stopped tracking temperature.

D’Arrigo’s landmark 2008 review (Global and Planetary Change 60:289–305) catalogued at least six plausible mechanisms.³ Perhaps warming pushed northern trees past a drought-stress threshold where growth declines even as temperature rises. Perhaps global dimming — anthropogenic aerosols reducing surface solar radiation from roughly 1950 to 1985 — cut photosynthesis even as the air warmed. Perhaps increased UV-B radiation from ozone depletion stressed cambial growth at high latitudes. Perhaps snowmelt timing shifted the growth season in ways that decoupled ring width from mean summer temperature. Perhaps it was survivorship bias all along: Brienen et al. (2012) argued that selecting the largest living trees biases modern chronologies upward during the calibration period, making everything afterward look artificially low.⁴ Perhaps detrending artifacts in the statistical standardization methods created spurious recent declines.⁵

Six mechanisms. None mutually exclusive. None definitively isolated.

The divergence is concentrated in far-northern forests and absent at lower latitudes, which constrains the explanation to something that changed specifically at high latitudes after 1960. But “specifically” still leaves a tangle of covarying environmental factors that resist clean forensic separation.

This messiness is not incidental. It is diagnostic. A 2023 paper in Behavioral and Brain Sciences — with the memorably blunt title “Dead rats, dopamine, performance metrics, and peacock tails” — makes the argument formally: proxy failure is not a bug to be fixed but an inherent property of optimization against indirect measures.⁶ A proxy works precisely because multiple causal pathways connect it to the underlying quantity. When the connection breaks, multiple pathways fail simultaneously or independently, and post-hoc attribution is underdetermined by the available data.

The tree rings were nobody’s fault. The environment changed. The proxy kept producing numbers. The numbers just stopped meaning what everyone had come to assume they meant.

The test that stopped testing

Now watch the same pattern play out in a domain with no biology, no weather, and no trees.

In developmental psychology, the Sally-Anne task is the gold-standard proxy for theory of mind — the ability to understand that someone else’s belief can differ from reality. You tell a child that Sally puts a marble in a basket, leaves the room, and Anne moves the marble to a box. Where will Sally look for the marble? Children who pass — typically around age four — understand that Sally’s belief doesn’t update when she’s absent.

When Michal Kosinski tested GPT-4 on forty bespoke false-belief tasks (2024, PNAS), it passed roughly 75%, matching the performance of six-year-old children.⁷ Scaling from GPT-3 to GPT-3.5 to GPT-4 produced monotonically increasing scores. The natural inference: large language models were developing something like theory of mind as a byproduct of scale. The proxy appeared to be working.

Then Tomer Ullman made a trivially small change.

In a 2023 paper, Ullman swapped the opaque container for a transparent one.⁸ Logically, this is a different problem — Sally can now see through the container — but it requires no new cognitive architecture, just knowing what “transparent” means. GPT-3.5 dropped to 6%: one correct answer out of sixteen. The SCALPEL follow-up found GPT-4 managed 20.35% on the transparent-access variant — barely better than chance.⁹ But here is the detail that makes this a proxy-failure story rather than merely a capability-failure story: when the researchers added a single explicit line stating the character “recognizes” the contents, GPT-4 jumped to 89.64%.

The capability was there. The proxy just couldn’t see it anymore.

The same signature showed up in code generation. OpenAI’s HumanEval — 164 hand-written Python problems introduced in 2021 — became the standard measure of LLM coding ability. For several years, pass@1 scores tracked intuitive impressions of model capability: Codex scored modestly, GPT-4 scored impressively. The proxy appeared to be faithfully tracking something real.

Then frontier models began scoring above 90%, and HumanEval was quietly dropped from current model comparisons — all top models had saturated it. Was the 90%+ genuine? Partially. Bradbury and More (2024) created HumanEval-T, a suite of combinatorial and lexically distinct variants of the same problems designed to prevent memorization, and found all tested models dropped 5 to 14 percentage points.¹⁰ That gap is not a model failure. It is a direct measurement of how much of the original score was contamination signal rather than capability signal. HumanEval’s solutions had been so widely disseminated across the training web that Qwen-2.5-Coder now explicitly removes all training data with a 10-gram collision against HumanEval’s test set — an acknowledgment by a frontier lab that the contamination is real and measurable.¹¹

HumanEval-T is to HumanEval what thermometers were to tree rings: the independent instrument that reveals the proxy has quietly stopped tracking what you thought it was tracking.

Where the analogy breaks

Three important ways, ordered by severity.

First, the mechanisms are genuinely different. Tree rings diverged because the physical environment changed — nobody gamed a tree. HumanEval diverged partly because optimization pressure drove labs to teach to the test, a quasi-intentional Goodhart dynamic. Sally-Anne sits somewhere between: no one intentionally gamed it, but training-data distributions shifted the surface statistics that the task depended on. The proxy-failure structure is shared, but the causal stories are not interchangeable. If the pattern required identical mechanisms, it would just be one domain studied three times — and it wouldn’t be very interesting.

Second, the timescales are wildly different. Tree rings had an approximately thousand-year warranty. HumanEval lasted roughly three years before saturation. Sally-Anne’s strong performance claims appeared in early 2023 and Ullman’s rebuttal arrived the same month — a warranty measured in weeks. Faster-moving fields burn through proxies faster, which should inform how aggressively you schedule rotation.

Third, Goodhart’s Law already explains benchmark contamination, so why invoke tree rings at all? Because Goodhart describes intentional gaming — optimizing for the metric until it decouples from the target. The tree-ring divergence shows that proxy failure can happen without anyone gaming anything. The environment changed, and the proxy quietly stopped tracking. This extends the proxy-failure problem from gaming to drift — a broader and more concerning failure mode, because you can’t fix drift by removing the optimizer. There isn’t one.

The warranty you didn’t know you had

Across all three domains, the same five-phase structure plays out.

Calibration: the proxy tracks the underlying quantity faithfully. Rings match thermometers. Sally-Anne scores rise with model capability. HumanEval pass rates climb as models improve.

Reliance: institutions build on the proxy. Climate reconstructions cite tree-ring data. “Emergent theory of mind” narratives anchor to Sally-Anne trajectories. Leaderboard positions drive lab funding and marketing.

Silent divergence: the proxy quietly decouples. Trees stop tracking temperature after 1960. Ullman’s perturbations break ToM scores. Contamination inflates HumanEval. Nobody notices in real time, because the proxy keeps producing numbers that look exactly like data.

Hindsight discovery: an independent measurement reveals the gap. Thermometers for tree rings. Adversarial task variants for Sally-Anne. HumanEval-T and LiveCodeBench for HumanEval. You always need a second instrument — and you never think you need one until after you needed it.

Overdetermined forensics: multiple plausible explanations emerge, none cleanly isolable. The forensics are messy by construction, because the proxy worked through multiple causal pathways, and when it fails, multiple pathways break at once.

This five-phase pattern recurs because proxy failure is structural, not accidental. It is, as the BBS authors put it, “an inherent risk in goal-oriented systems.”⁶ Any time you measure indirectly, you inherit a warranty. The warranty is always unstated.

Three operational consequences follow.

Rotate your benchmarks before you need to. LiveCodeBench (ICLR 2025) uses rolling monthly updates from competitive programming platforms so that no model can train on the test set.¹² Any production metric that has been static for more than two years is a candidate for contamination, Goodhart dynamics, or silent calibration drift. Schedule the rotation on the calendar. Don’t wait for the divergence to become visible.

Treat stability as a warning, not a comfort. A metric that hasn’t moved in two years is not necessarily reliable. It may be saturated, gamed, or decoupled from what it once tracked. The tree-ring divergence was invisible precisely because nobody was routinely checking the proxy against the instrument in real time.

Know that scaling your optimization does not fix proxy failure — it accelerates it. Pan et al. (2022) demonstrated that larger models show increased proxy rewards but decreased true rewards.¹³ More optimization power doesn’t converge on the target; it diverges from it. If your response to a suspect metric is “optimize harder,” you are feeding the problem.

The label that was never printed

Here is what makes the divergence problem genuinely unnerving: the trees were not wrong for a thousand years and then suddenly wrong. The proxy worked. It worked brilliantly. It earned every bit of the trust placed in it. And then the conditions under which it had been calibrated changed — slowly, silently, through mechanisms that covaried in ways that still resist clean forensic isolation three decades later — and the numbers it produced stopped meaning what everyone had come to assume they meant.

Every metric you rely on is running on the same kind of warranty. Your NPS score, your sprint velocity, your test coverage percentage, your interview rubric, your annual review ratings. Each was calibrated under conditions that will not hold forever. Each will expire, and you will discover the expiration in hindsight. The only question is whether you built redundancy into your measurement infrastructure before the drift became visible — whether you maintained the habit of checking your proxy against an independent instrument, and the discipline to distrust a number that has been comfortably stable for a little too long.

The trees kept faith for a thousand years. When they stopped, nobody heard it happen. That is the nature of a warranty with no printed expiration date: the silence sounds exactly like reliability, right up until you check.

Sources

Briffa, K.R. et al. (1998). “Reduced sensitivity of recent tree-growth to temperature at high northern latitudes.” Nature 391:678–682.
Jacoby, G.C. & D’Arrigo, R.D. (1995). “Tree ring width and density evidence of climatic and potential forest change in Alaska.” Global Biogeochemical Cycles 9:227–234.
D’Arrigo, R. et al. (2008). “On the ‘divergence problem’ in northern forests.” Global and Planetary Change 60:289–305.
Brienen, R.J.W. et al. (2012). “Tree height strongly affects estimates of water-use efficiency responses to climate and CO₂.” Survivorship bias critique of tree-ring chronology construction.
“The influence of decision-making in tree ring-based climate reconstructions,” Nature Communications (2021).
“Dead rats, dopamine, performance metrics, and peacock tails,” Behavioral and Brain Sciences (2023). DOI: 10.1017/S0140525X23002753.
Kosinski, M. (2024). “Evaluating large language models in theory of mind tasks.” PNAS.
Ullman, T. (2023). arXiv:2302.08399.
Pi, Z. et al. (2024). “Dissecting the Ullman Variations with a SCALPEL.” arXiv:2406.14737.
Bradbury, J.S. & More, R. (2024). “Addressing Data Leakage in HumanEval Using Combinatorial Test Design.” arXiv:2412.01526.
Qwen-2.5-Coder technical documentation. 10-gram collision decontamination against HumanEval test set.
LiveCodeBench (ICLR 2025). Rolling monthly updates from competitive programming platforms.
Pan, A. et al. (2022). Proxy reward vs. true reward divergence under scaling. See also: Weng, L. (2024), “Reward Hacking in Reinforcement Learning.”

Every proxy needs an independent instrument. Build one before you need it.

The essay’s prescription is redundant measurement — check your proxy against something that doesn’t share its failure modes. Chain of Consciousness applies this to agent systems: every action anchored to a verifiable external record, building an audit trail that doesn’t depend on the agent’s own self-report. Not “this system says it passed.” Verify what it can prove it did, against an instrument that wasn’t calibrated on the same assumptions.

See a verified provenance chain · Follow the anchors through a chain · pip install chain-of-consciousness

← Back to all posts