Espeland and Sauder Predict AI Benchmark Homogenization

In March 2026, researchers tested nine frontier language models from nine different labs against the same personality battery. GPT-5.1, Claude Haiku 4.5, Gemini 3 Flash, Qwen3 VL 235B, DeepSeek-V3.2, Grok 4 Fast, Kimi K2, Ministral-14b, Trinity-Mini. Companies that spend hundreds of millions of dollars annually trying to differentiate their products from each other.

The trait rankings across the nine models correlated at a Spearman ρ of 0.763. Between GPT-4o and GPT-5.1 alone — two consecutive generations of a single family — the trait “poetic” fell from rank 29 to rank 124. “Systematic” climbed into the top ten. Across every model in the study, Assistant traits (disciplined, objective, structured) outranked Creative traits (artistic, spontaneous, poetic) every time. Stylistic variation, the authors found, accounts for 64.2 percent of remaining inter-model differences — meaning that what looks like brand identity in chatbots is mostly surface texture on top of a shared substrate.

The paper is titled “Same Voice, Different Lab” (arXiv:2605.02897), and the title is a thesis statement disguised as a description.

This is not the first time researchers have watched a competitive ecosystem flatten itself under the gravitational pull of a measurement. It is just the fastest.

The diversity wasn’t argued away. It was measured away.

The Empirical Precedent

Wendy Espeland and Michael Sauder spent a decade documenting the same dynamic in American legal education. Their 2007 paper in the American Journal of Sociology — “Rankings and Reactivity: How Public Measures Recreate Social Worlds” — and their 2016 book Engines of Anxiety: Academic Rankings, Reputation, and Accountability followed U.S. News & World Report’s law school rankings through the 1990s and 2000s. They conducted more than 200 interviews with deans, admissions officers, students, and faculty. What they produced is the cleanest empirical study of measurement reactivity ever published: when a ranking becomes consequential, actors don’t just conform to it. They reshape themselves into the thing the ranking can see.

Espeland and Sauder identified two mechanisms.

The first is self-fulfilling prophecy: institutions optimize for whatever the metric can detect. Law schools redirected resources toward LSAT scores, expenditure per student, and selectivity ratios — the variables U.S. News weighed — and away from clinical programs, public-interest career support, and admissions discretion that couldn’t be reduced to a number. Career-services budgets shifted from counseling and network-building toward placement tracking, because placement counts factored into the rank.

The second mechanism is commensuration — a concept Espeland developed in a 1998 paper with Mitchell Stevens in the Annual Review of Sociology. Commensuration is the act of compressing diverse entities onto a single ordinal scale. Schools that had once been distinctive — one rooted in trial advocacy, another in international law, a third in legal-aid clinics — converged on the same playbook because the ranking demanded comparison along uniform dimensions. The diversity wasn’t argued away. It was measured away.

The downstream effects Espeland and Sauder documented — strategic gaming, structural homogenization, the psychological internalization of rank as personal worth, the pervasive institutional anxiety that gives their book its title — map onto AI benchmark culture with disturbing precision.

The difference is timescale.

The Mechanism, Compressed

In April 2025, Singh et al. published “The Leaderboard Illusion,” documenting how major AI labs game Chatbot Arena, the de facto popular leaderboard for chatbot quality. The headline finding: in the lead-up to Llama-4, researchers identified 27 private Meta variants tested against Arena before public release. Meta, OpenAI, Google, and Amazon all ran multiple private variant tests, then submitted only their strongest performers. The paper estimates that even modest additional access to Arena data could boost a model’s Arena performance by up to 112 percent.

Translate that into the Espeland-Sauder framework: the Arena Elo score is measuring resource access as much as model quality. A lab that can afford 27 private variants gets a score that a lab that can run three does not. That is exactly what U.S. News ended up measuring — institutional wealth dressed up as institutional quality. The leaderboard became a proxy for the inputs to optimization, not the thing it claimed to compare.

GPT-4 on Codeforces showed the same pattern at a finer grain. On programming problems published before September 5, 2021 — the training-data cutoff — the model could regularly solve those classified as easy. On problems published after that cutoff, it could not get a single question right. The benchmark wasn’t testing reasoning. It was testing memorization, dressed as reasoning. In a 2024 review by Zhang et al., 30 language models were analyzed for train-test overlap. Only 9 reported having checked. The other 21 either didn’t check or didn’t disclose. In the Espeland-Sauder lexicon: data manipulation, exactly as the Wall Street Journal uncovered in college reporting in the 1990s.

But the most consequential parallel isn’t gaming. It’s homogenization.

The Homogenization Is Already Here

“Same Voice, Different Lab” quantifies what practitioners had been describing anecdotally for over a year: frontier models from different labs increasingly produce outputs indistinguishable on benchmark-relevant dimensions while diverging on dimensions benchmarks do not measure. Standard deviation at the top of the trait rankings is roughly σ≈9; at the bottom, σ≈16; in the middle, σ≈23. The models agree about what’s good. They diverge about what’s irrelevant.

This is not an isolated finding. The NeurIPS 2025 paper “Artificial Hivemind: The Open-Ended Homogeneity of Language Models” describes “extreme mode collapse” in open-ended generation — distinct models acting as “near-identical clones” on creative tasks. A 2026 study in Trends in Cognitive Sciences by Doshi and Hauser, “The Homogenizing Effect of Large Language Models on Human Expression and Thought,” documents what happens when this convergence touches actual humans: Indian participants using LLMs to write culturally significant essays produced outputs that “became more similar to those of American participants,” losing “region-specific cues such as rituals, collective symbols, and lexical markers, replaced by more generic or Westernized narratives.”

This is commensuration operating outside the lab. The metric — what gets a high reward score during training — is reshaping the world in its own image. Just as U.S. News rankings reshaped American legal education into a single template across two decades, the implicit benchmark embedded in RLHF-trained assistants is reshaping global written expression into a single voice across two model generations.

RLHF, in this framework, plays the role of the U.S. News methodology. It is a scoring function. Optimizing actors converge toward it. There is one consequential difference: U.S. News at least published its weights. RLHF reward models are opaque even to the labs that train them. The implicit model of “good output” that the entire field is optimizing toward has never been democratically chosen, audited, or published.

Where the Analogy Breaks

Three places.

First, intentionality differs. Law-school administrators knew they were gaming U.S. News and chose to. AI homogenization is partly emergent — a property of training objectives interacting with shared evaluation infrastructure. The 27 private Meta variants are the conscious end of the spectrum; the convergence of Indian essays toward American style is the emergent end. Both are operating, but the analogy is cleanest at the gaming end.

Second, reversibility differs, and this is the version that gets harder rather than easier. A law school that closed its clinical program to fund LSAT prep can reopen the clinical program. A model trained on benchmark-contaminated data cannot be untrained. The homogenization is baked into weights. This makes the AI version strictly more dangerous than the education version — what took law schools a generation to break, AI labs may bake permanently into the foundation models the field builds on.

Third, the number of actors differs. U.S. News ranked about 200 law schools. The AI benchmark ecosystem involves roughly 10 to 15 frontier labs but tens of millions of downstream users who consume benchmarks as a proxy for capability. The bottleneck for collective action sits in fewer hands, which cuts both ways: easier to coordinate withdrawal, harder to ignore defectors.

The disanalogies sharpen the case. AI benchmark reactivity is faster, more permanent, and harder to walk back than the education version. The institutional precedent matters more, not less.

What Reed College Did, and Why It Failed

In 1995, Reed College’s president Steven Koblik withdrew the school from U.S. News rankings after a Wall Street Journal investigation revealed widespread data manipulation across competitor institutions. He called the methodology “not credible.” U.S. News responded by assigning Reed minimum scores in several categories and relegating the college to the lowest tier — a systematic penalty for any institution that refused to participate. Reed continued providing data to alternative guides — Barron’s, Fiske Guide, Princeton Review — that described “the experience, student culture, and academic environment” instead of producing an ordinal rank.

Reed’s withdrawal triggered no cascade. The cost of unilateral defection was too high. The ranking was a coordination game, and one defector made themselves a target.

It took twenty-seven years for the coordination game to flip. On November 16, 2022, Yale Law School Dean Heather K. Gerken withdrew from U.S. News rankings; Harvard Law School Dean John Manning withdrew within hours; Columbia Law School Dean Gillian Lester followed on November 18. About two dozen additional law schools joined within weeks. In January 2023, Harvard Medical School led a parallel medical-school exodus joined by Stanford, Columbia, Penn, and Mount Sinai. U.S. News modified its methodology after the boycott: LSAT weighting dropped from 11.25 percent to 5 percent; undergraduate GPA weighting dropped from 8.75 percent to 4 percent.

The 27-year lag between Reed’s pioneering withdrawal and the elite-coordinated boycott is the most important number in this entire story. Twenty-seven years during which the distortions were widely understood, formally complained about (the earliest documented complaints from law school officials trace to 1998), and tolerated. Twenty-seven years of institutional self-shaping toward a metric that everyone involved knew was distorted.

AI benchmark culture is now, by the most charitable estimate, three years into the equivalent cycle. The first serious critiques of contamination and gaming dropped in 2023. If the lag is proportional, the elite-coordinated boycott equivalent — the Yale-Harvard moment for AI evaluation — is somewhere in the mid- to late-2040s. The compressed timescales of AI development might compress the response too, but the null hypothesis is that we are at the beginning of a long cycle, not the end of a short one.

The Shadow Benchmark

Here is the move the field has not made yet, and the one Espeland and Sauder implicitly demand.

Every authoritative benchmark needs a published shadow benchmark measuring what optimization for the headline metric sacrificed.

For SWE-bench, the shadow benchmark measures whether code passes review by senior engineers who didn’t see the test cases — whether it’s idiomatic, maintainable, and stylistically coherent with the surrounding codebase. For MMLU, the shadow benchmark measures whether the model can defend an answer against a plausible-but-wrong rebuttal, whether it can identify which questions are mis-keyed, whether it can teach the underlying concept to a confused student. For Chatbot Arena, the shadow benchmark measures variance: how different are this model’s outputs from those of every other frontier model on the same prompts? A model that scores well on Arena and produces outputs distinguishable from its competitors is genuinely advancing the frontier. A model that scores well and produces convergent outputs is winning the contest of who can hit the implicit reward most precisely.

Reed College in 1995 was, in effect, proposing the shadow-benchmark strategy at the institutional level. Barron’s, Fiske, and Princeton Review measured what U.S. News couldn’t see — the texture of educational experience, the distinctive character of each school. The strategy failed not because the alternatives were wrong but because U.S. News had enforcement: a single dominant metric could punish defectors by assigning them minimum scores. In AI, the equivalent enforcement is product-market: a model that doesn’t report benchmark scores looks inferior to one that does.

The case for shadow benchmarks isn’t that they replace headline benchmarks. It’s that they make the cost of optimization legible. Right now, every time MMLU-Pro climbs another two points, no one publishes the dimension on which the field lost ground. The optimization is doing something — RLHF is not free — but the something is invisible. A shadow benchmark makes the trade explicit: this many points of MMLU bought this much homogeneity, this much loss of stylistic range, this much narrowing of cultural voice.

The first lab to publish a shadow benchmark alongside its headline scores acquires a credibility advantage, not a strategic disadvantage. Anthropic’s Responsible Scaling Policy, published in September 2023, is a working precedent: a self-binding commitment that turned out to be reputationally accretive rather than dilutive. Shadow benchmarks can play the same role. Publish them and the field’s measurement vocabulary expands. Refuse to publish them and competitors who do acquire an asymmetric reputational claim.

The path to first adoption is narrower than it sounds. A small consortium — three frontier labs, or one lab plus the major evaluation organizations — could publish a shared shadow-benchmark protocol that any model submission must complete. The first signers buy credibility cheaply because nobody else has it yet. Subsequent signers either follow or accept the asymmetry. This is the same coordination dynamic that flipped the law schools in 2022, just compressed.

What to Do Now

The cycle Espeland and Sauder documented ran on decades. The AI version is running on model generations. Here is what that means for anyone building or deploying AI systems right now.

When evaluating a model, ask not just where it scores on a benchmark but what dimensions of capability the benchmark cannot see. Read the model’s outputs against your domain’s hardest unstructured cases, not against the benchmark questions. Two models scoring within five points on a leaderboard are likely indistinguishable on that leaderboard’s questions and may be radically different on yours.

When building evaluations internally, design them so that scoring well requires capabilities the model was not trained to game. Hand-author cases. Refresh them on a schedule shorter than your foundation-model update cadence. Treat your eval set as a wasting asset.

When reading a paper that reports state-of-the-art on any public benchmark, ask whether the authors disclose train-test overlap analysis. Twenty-one out of thirty reviewed models did not. The absence is data.

And when someone tells you the AI field is converging on superhuman intelligence, ask which dimensions of intelligence are being measured, who chose them, and what intelligence-shaped things the optimization is grinding away at the edges. Espeland and Sauder spent ten years documenting that a measurement regime reshapes the thing it measures. The faster the regime moves, the less time there is to notice what it is reshaping. Right now, the regime is moving fast.

The number to keep in mind is 0.763 — the Spearman correlation across nine frontier models on a shared personality battery. That number is not a curiosity. It is the field telling on itself, in the same voice Espeland and Sauder heard from law school deans twenty years ago, asking why the institution they thought they ran had quietly become the institution the ranking implied they should run.

Sources: “Same Voice, Different Lab” (arXiv:2605.02897, 2026); Espeland & Sauder, “Rankings and Reactivity: How Public Measures Recreate Social Worlds,” American Journal of Sociology, 2007; Espeland & Sauder, Engines of Anxiety: Academic Rankings, Reputation, and Accountability (Russell Sage, 2016); Espeland & Stevens, “Commensuration as a Social Process,” Annual Review of Sociology, 1998; Singh et al., “The Leaderboard Illusion,” April 2025; GPT-4 Codeforces training-cutoff analysis (Sept 5, 2021); Zhang et al., language-model train-test overlap review, 2024; “Artificial Hivemind: The Open-Ended Homogeneity of Language Models,” NeurIPS 2025; Doshi & Hauser, “The Homogenizing Effect of Large Language Models on Human Expression and Thought,” Trends in Cognitive Sciences, 2026; Reed College / Steven Koblik withdrawal coverage, 1995; Yale Law School / Heather K. Gerken withdrawal announcement, November 16, 2022; subsequent U.S. News law and medical methodology changes, 2023; Anthropic Responsible Scaling Policy, September 2023.

Publish the Shadow Benchmark Next to the Headline Score

The essay’s prescription is the same split Deming gave manufacturing seventy years ago, applied to AI evaluation: measurement-for-learning is a different job from measurement-for-judgment, and they corrupt each other when fused. The Agent Trust Stack is that split as composable libraries — signed, append-only records of what an agent actually did (Chain of Consciousness), portable rebuttable ratings layered on top (Agent Rating Protocol), and a rule that ratings are inputs to human judgment, never autopilot for it.

pip install agent-trust-stack npm install agent-trust-stack

For the provenance layer specifically — the signed action chain a rating can point at, the closest thing the field has to Reed College’s alternative-measurement strategy — Hosted Chain of Consciousness ships it as a service. The first lab to publish a shadow benchmark acquires a credibility advantage; the same logic applies one layer down.