The Silver Surface Problem: Gresham’s Law in the Age of AI Benchmarks

A Roman denarius reads 95% silver on the surface and 35% in the core. The same gap now defines AI evaluation.

Published April 2026 · 11 min read

In 2014, a team using a Bruker M4 TORNADO micro-XRF spectrometer pointed the beam at a Roman denarius and read the surface at 95% silver. A healthy coin—close to the purity of Rome’s Republican glory days, the kind a centurion might have weighed in his palm with satisfaction. Then they polished a small section of the coin’s edge and scanned again.

The core read 35%.

That sixty-percentage-point gap between surface and substance changed how we understand Roman economic history. For generations, scholars had charted the debasement of the denarius as a smooth, roughly linear decline—from the high-purity silver of the Republic through the gradual degradation of the Empire to the copper-washed tokens of the third-century crisis. The narrative was tidy. The curves were clean. And the evidence was systematically wrong, because every data point was a surface reading.

How Surfaces Lie

Two distinct processes conspire to make a debased coin read as genuine.

The first is natural. When a silver-copper alloy sits in soil for centuries, copper—the more chemically reactive metal—migrates outward and leaches into the surrounding earth. Silver, nobler and more stable, stays put. The surface slowly concentrates in silver while the core remains copper-rich. An archaeologist running surface-only XRF on a burial find is measuring centuries of selective corrosion, not the coin’s original composition.

The second process was intentional, and far more revealing. Roman mints under fiscal pressure soaked freshly struck coin blanks in dilute vinegar, dissolving the copper from the outer layer to leave a thin shell of nearly pure silver over a debased core. As Manukyan et al. documented in their 2019 Applied Surface Science study of Roman denarii surface and interior composition, this selective copper leaching was the primary method used during the second and third centuries CE. Its purpose was explicit: keeping the public unaware of the debasement.

Both mechanisms produce the same misleading signal. A copper-rich coin that reads as silver-rich. The natural version might fool an archaeologist two millennia later. The intentional version was designed to fool an empire in real time.

Kevin Butcher at the University of Warwick and Matthew Ponting at the University of Liverpool spent years drilling past that surface. Using bulk analysis techniques—inductively-coupled plasma spectrometry, scanning electron microscopy, laser ablation mass spectrometry—they built the definitive metallurgical survey of Roman silver coinage, published by Cambridge University Press in 2014. What they found wasn’t a gentle decline. It was a series of lurches, each tied to a specific fiscal crisis.

Republican denarii held above 95% silver for generations. Nero made the first significant cut around AD 64, dropping to roughly 80%. Vespasian held near that range while aggressively recycling older coins—a 2023 paper in Archaeological and Anthropological Sciences documented extensive mixing of lead and silver ores under his reign. Septimius Severus hacked the standard to around 50% by AD 200. And by the time Claudius II was minting around AD 270, the denarius contained 2–4% silver—essentially a bronze coin wearing a silver mask.

The old debasement curve, built on decades of surface-only XRF, was a surface reading of a surface reading. As Bruker’s own application literature puts it: “The debasement of the Roman denarius was not as linearly progressive as had been initially believed.”

The Same Gap, New Metal

In 2026, Microsoft’s Phi-4 scored roughly 85% on MMLU, the most widely cited AI benchmark in the industry. On SimpleQA—a test specifically designed to resist the kinds of optimization that inflate MMLU scores—it scored 3%.

That is the Bruker moment for AI evaluation. A model that reads 85% silver on the surface is 3% silver in the core. MMLU is the surface XRF scan. SimpleQA is the polished-edge core sample.

The mechanisms map with uncomfortable precision. AI benchmark scores get enriched through the same dual process—natural and intentional—that enriched Roman coins.

Natural contamination works like soil-corrosion enrichment. Models trained on internet-scale data inevitably encounter benchmark questions embedded in blog posts, textbooks, forums, and study guides. The GPT-3 paper flagged “significant contamination across many benchmarks—in some exceeding 90%.” The GPT-4 technical report acknowledged that “portions of BIG-bench were inadvertently mixed into the training set.” Given enough data and enough time, the enrichment is thermodynamic—not malice, just information leaching from a porous environment.

Intentional optimization is the acid bath. Fine-tune on benchmark-adjacent data. Submit your best of twenty-seven private variants to leaderboards while the public gets the baseline—as Meta did with Llama 4, which debuted at rank #2 on Chatbot Arena before the production version dropped to rank 32. Meta’s chief AI scientist Yann LeCun acknowledged the results were “fudged a little bit.” Or fund the creation of the benchmark your model will be tested against while getting exclusive early access to the problems, as OpenAI did with FrontierMath—the equivalent of the Roman mint funding the assay office.

The contamination isn’t subtle. When Hugging Face audited Yi-34B’s MMLU performance, they found a 94% chance of training-data contamination. CausalLM/34b scored 85.6% on MMLU—a result the auditors flagged as “not theoretically possible for a 34-billion-parameter dense model.” When Scale AI created GSM1k, a set of math problems equivalent to the saturated GSM8K but genuinely new, accuracy dropped up to 13 percentage points, with an r-squared of 0.32 between model performance and memorization metrics. The surface and the core told different stories every time someone thought to check.

The most striking proof that the surface has detached from the core: small models trained exclusively on benchmark questions have outperformed systems orders of magnitude larger. A tin slug dipped in silver passing the assayer’s table. The benchmark measures memory, not intelligence.

And the enrichment is structural—not a fixable bug. Sun et al. tested every major contamination mitigation strategy in their 2025 study and found that none is “both effective and faithful to the evaluation objective.” The surface enrichment isn’t an accident in the process. It is the process. Meanwhile, RLHF and preference optimization create their own version of the problem: models “concentrate on familiar, high-hit-rate patterns, reducing the diversity of reasoning patterns the base model originally had,” as Sebastian Raschka documented in his survey of reasoning-model training. At larger sample sizes, the base model actually produces more diverse correct solutions than the optimized version. The optimization makes the surface shinier while narrowing the core.

Even the attempts to evaluate beyond benchmarks are compromised. When LLMs are used to judge other LLMs—an increasingly common shortcut—they show biases that mirror the surface-enrichment problem. They rate longer responses higher even when the extra length is filler: the thicker the silver wash, the better the coin grades. Claude-v1 shows 25% self-preference bias; GPT-4 favors its own outputs by 10%. The assayer has a stake in the mint.

Bad Money Drives Out Good

In the 1560s, Sir Thomas Gresham explained to Queen Elizabeth why sound coins kept vanishing from circulation: when two currencies share the same face value but different intrinsic worth, people spend the debased money and hoard the good. Bad money drives out good. But the mechanism has a precondition so fundamental it’s easy to miss: the buyer can’t tell them apart.

The buyer can’t tell them apart because the surface looks the same.

This is formally equivalent to George Akerlof’s market for lemons, as Lester, Postlewaite, and Wright demonstrated in their Minneapolis Fed working paper unifying the two frameworks. Both describe adverse selection under quality uncertainty. When buyers can’t verify quality, sellers of genuinely superior goods can’t command a premium, so they exit the market. Average quality spirals down even as prices hold steady.

This dynamic is now playing out in AI. Benchmark scores are the face value at which all models circulate. A model optimized to ace MMLU and a model with genuine reasoning depth present similar numbers to the enterprise buyer evaluating a procurement decision. The buyer—the merchant accepting these coins—can’t easily distinguish surface from substance. The optimized models are cheaper to build (enriching the surface always costs less than improving the core), so they proliferate. The market incentive shifts from building capability to polishing the display. Average real capability stagnates even as benchmark scores climb.

Erlei et al. tested this experimentally, publishing their results in January 2026 as “When Life Gives You AI, Will You Turn It Into A Market for Lemons?” Their 330 participants delegated tasks to low-quality AI systems at rates exactly matching the systems’ prevalence—30%, 60%, 90%—demonstrating “no spontaneous quality discrimination.” The surface fooled them completely.

The most sobering finding came next. Even with full disclosure—accuracy scores, data quality indicators, the complete picture—only 57.7% of participants in the high-density condition chose the high-quality system. Even near-perfect information didn’t fix the market. Even after Butcher and Ponting published the correct bulk compositions of Roman denarii, the simplified surface-reading narrative persisted in popular accounts for years. Knowing the surface is misleading doesn’t automatically change how people choose.

The Weight Standard

But one historical reform proved that the right structural intervention can break the dynamic entirely.

In the late seventh century, Islamic coinage was chaos. Mints across the caliphate struck imitative copies of Byzantine and Sasanian designs with varying weights, uncertain standards, and unreliable value. These were the debased, surface-enriched coins of a monetary system in transition—some copying gold solidus designs, some experimenting with the Caliph’s own portrait, all of uncertain worth.

In 77 AH (696–697 CE), Caliph Abd al-Malik ibn Marwan replaced all of it. His reform introduced a transparent weight standard—the gold dinar at 4.25 grams—with purely epigraphic designs that stripped away the visual noise of imitative coinage. The standard was publicly known. Verification required nothing more than a scale. Any merchant in any souk could test any coin in seconds, no assayer required. The coins were, as the historical record shows, “used without appreciable change for the whole of the Umayyad period, struck to a new and carefully controlled standard.”

The reform worked because it satisfied three conditions: the standard was transparent, verification was simple, and compliance was mandatory—not voluntary best practice, but enforced policy.

AI benchmarking needs its Umayyad reform. Two measures would address most of the surface-enrichment problem.

Public training-data manifests are the transparent weight standard. If providers disclosed what data went into training, evaluators would know which benchmarks remain valid tests and which are contaminated. The information asymmetry that enables the Gresham dynamic—bad models mimicking good ones because buyers can’t distinguish the core—collapses when the composition is declared.

Dynamic, held-out benchmarks are the merchant’s scale—cheap, accessible, and resistant to manipulation. LiveBench, an ICLR 2025 spotlight paper, generates new questions monthly from current sources, making pre-training contamination impossible by construction; top models still score below 70%, and inter-month score correlations of 0.997 confirm the test measures capability rather than content drift. Humanity’s Last Exam, published in Nature in 2026, assembled 2,500 questions from nearly a thousand domain experts across fifty countries, rejecting any question frontier models could already answer; the best performers sit below 50%. ARC-AGI-2 and Pencil Puzzle Bench test genuine abstraction on novel-by-construction tasks where memorization provides no advantage.

Neither measure is perfect. No assay technique ever is—every method has its detection limits, every scale its tolerance. But together they represent the difference between accepting a surface at face value and testing what lies beneath it.

Trust Only the Core

Until mandatory training-data disclosure arrives and dynamic benchmarks become standard—until the reform—there is a practical discipline worth adopting.

When Bruker’s team wanted the true composition of a Roman denarius, they didn’t accept the surface reading. They polished a small section of the edge and scanned what lay beneath. The surface told them what the coin wanted them to see. The core told them what the coin actually was.

Do the same with any model you’re evaluating. Ignore the leaderboard. Design your own tests—tasks drawn from your actual work, problems no training set has seen, challenges that resist memorization. That’s your polished edge. That’s your core sample.

The surface is always optimized to impress you. Trust only the core.

Sources: Manukyan et al., Applied Surface Science (2019); Butcher, K. and Ponting, M., The Metallurgy of Roman Silver Coinage, Cambridge University Press, 2014; Brown et al., “Language Models are Few-Shot Learners” (GPT-3), 2020; OpenAI, “GPT-4 Technical Report,” 2023; Sun et al., “The Emperor’s New Clothes in Benchmarking?” ICML 2025; Raschka, S., “The State of Reinforcement Learning for LLM Reasoning,” 2025; Erlei et al., “When Life Gives You AI, Will You Turn It Into A Market for Lemons?” CHI 2026; White et al., “LiveBench,” ICLR 2025; Phan et al., “Humanity’s Last Exam,” Nature, 2026; Lester, Postlewaite, and Wright, “Gresham’s Law in a Lemons Market,” Minneapolis Fed Working Paper.

The surface is optimized to impress. Build the instrument that reads the core.

Every benchmark score is a surface reading — the same gap Gresham described in 1560 between a coin’s face value and its true silver content. Chain of Consciousness creates the core sample: a cryptographic, tamper-evident provenance chain where every action is anchored and every result independently verifiable. When the surface can be enriched, you need the record that cannot.

pip install chain-of-consciousness · npm install chain-of-consciousness
See a live provenance chain →

← Back to all posts