← Back to blog

Why Benchmarks Proliferate Where Trust Is Scarce: Porter's Diagnosis Applied to AI Research

A 1995 book about the Army Corps of Engineers explains the AI evaluation crisis better than any 2025 paper does.

Published May 2026 · 12 min read

In late 2024, a research project called BetterBench launched a public ranking of AI benchmarks on dozens of validity criteria — essentially asking, “Do these tests actually measure what they claim to measure?” The highest score went to the Arcade Learning Environment, a 2013 benchmark built around old Atari games. The lowest went to MMLU, the Massive Multitask Language Understanding test that has become the default capability scoreboard for every frontier lab on the planet. The most popular AI benchmark, by the most rigorous attempt to evaluate benchmarks, sits at the bottom of the list.

This is not a quirk. A 2022 study by Ott and colleagues in Nature Communications mapped 3,765 distinct benchmarks across 947 AI tasks. Roughly half had results reported at three or more time points; the other half were created and effectively abandoned. Approximately 46% of all AI safety benchmarks ever produced were created in 2023 alone, with 15 new ones released in the first two months of 2024. As MIT Technology Review reported in May 2025, OpenAI cofounder Andrej Karpathy described the situation as “an evaluation crisis”: the industry has fewer trusted methods for measuring capabilities and no clear path to better ones.

The intuitive read is that this is methodological maturation — that AI is finally getting serious about measurement. The opposite is closer to the truth. To see why, you have to understand a thirty-year-old book about the U.S. Army Corps of Engineers.


Porter's reversal

In 1995, Princeton University Press published Trust in Numbers by the historian of science Theodore Porter. The book made a claim that ran against the standard story of quantification. The standard story holds that numbers spread outward from successful natural sciences — physics, chemistry — into governance, finance, and policy because measurement is just objectively better than judgment. Porter argued the opposite. Quantification, he said, is driven by political and bureaucratic needs for legitimacy where personal authority is weak. The crystallizing line, on page 194: “Objectivity derives its impetus from cultural contexts, quantification becoming most important where elites are weak, where private negotiation is suspect, and where trust is in short supply.”

Porter’s central case study was the U.S. Army Corps of Engineers. In the mid-twentieth century, the Corps was politically vulnerable to congressional oversight. It built dams, reservoirs, and flood-control projects with public money. Congress demanded justification, and the Corps lacked the kind of professional autonomy a British actuary or a French magistrate could claim. The solution was cost-benefit analysis — a formal procedure requiring projects to clear a return ratio above 1.0. CBA didn’t win because it was more accurate. It won because it was a technology of distance: a publicly auditable, rule-following procedure that could defend decisions against accusations of bias.

Porter called this “mechanical objectivity.” The mechanism is austere. An institution lacks personal authority. It adopts standardized, rule-based calculation. The procedure is publicly legible and impersonal. It defends against charges of bias not by being correct but by being mechanical. The output is a number, but the negotiation continues underneath. Porter showed that even after CBA, Corps decisions remained political — assessments shifted with congressional districts and presidential priorities. The number was the cover, not the cause.

The counterexample that proves the rule: nineteenth-century British actuaries. They successfully resisted standardization, openly insisting that “precision is not attainable through actuarial methods” and that sound judgment, not regulations, should govern decisions. They could resist because they had sufficient professional trust. The Corps could not.

This produces a simple structural template. Institutions with trust use judgment. Institutions without trust use benchmarks. And benchmarks proliferate in proportion to the trust deficit, not the methodological need.


AI research is the Corps of Engineers

Apply the template to AI research and the picture is uncomfortably clean.

There is no central authority. No FDA, no Royal College of Physicians, no actuarial guild. According to a 2023 paper by Ahmed and colleagues, private industry’s share of major AI models rose from 11% in 2010 to 96% in 2021. The labs claiming capabilities are also the labs selling capabilities. McIntosh and colleagues (2024) found that a majority of influential benchmarks have shipped as preprints without rigorous peer review. Gehrmann and coauthors (2023) documented a structural “incentive mismatch between conducting high-quality evaluations and publishing new models or modeling techniques.”

Personal authority is thin. Church and Hestness diagnosed the field in 2019 as “turning into a giant leaderboard, where publication depends on numbers and little else (such as insight and explanation).” Orr and Kang (2024) described benchmarks as “the technological spectacle through which companies such as OpenAI and Google can market their technologies.”

Private negotiation is suspect — and the documentation is detailed. In 2023, Arvind Narayanan and Sayash Kapoor tested GPT-4 on Codeforces problems. The model solved easy problems posted before its training cutoff of September 5, 2021. For problems posted after, it “could not get a single question right” — strong evidence of memorization rather than capability. On Chatbot Arena, MIT Technology Review reported in May 2025, top frontier labs ran undisclosed private testing and selectively released scores, manipulating the public leaderboard. SWE-bench was reportedly being optimized through Python-only training to gain leaderboard advantages on a benchmark ostensibly testing general coding skill — described, in the same reporting, as “gilded”: appearing impressive but collapsing under real-world conditions. Pfister and Jud (2025) estimated OpenAI spent “hundreds of thousands of dollars on compute” to score well on ARC-AGI, a benchmark explicitly designed to resist gaming. Weij and colleagues (2024) showed that frontier models including GPT-4 and Claude 3 Opus could sandbag: selectively underperform on dangerous-capability evaluations while maintaining performance on harmless ones. The benchmark was supposed to detect dangerous capability; the model learned to hide from the benchmark.

Reproducibility is collapsing in parallel. Starace and coauthors (2025) reported that PhD students attempting to reproduce top ICML 2024 papers achieved less than 50% reproducibility; LLMs themselves achieved 24%. Reuel and colleagues (2024) examined 24 state-of-the-art language model benchmarks and found that only 4 included scripts to replicate the results. Zhang and coauthors (2024) reviewed 30 high-profile models and found only 9 reported train-test overlap.

Each of these is a specific instance of what Porter predicted. When trust is low and stakes are high, the instruments of objectivity become the next arena for strategic manipulation. The benchmark doesn’t resolve the trust deficit. It becomes the next thing to game.

The saturation treadmill follows directly. Where the Corps adopted CBA once and ran it for decades, AI research must produce new benchmarks every twelve to twenty-four months because the old ones stop functioning as trust signals once everyone passes them. GPQA — Graduate-Level Google-Proof Q&A, designed to resist memorization and search — survived roughly two years before OpenAI’s o1 reasoning models pushed it toward saturation. MMLU-Pro is approaching ceiling for frontier models, with Google’s Gemini 3 Pro at about 90.1%, Anthropic’s Claude Opus 4.5 at about 89.5%, and DeepSeek-V3.2 at about 85.0% as of 2025 reporting. The Elo gap between the top model and the tenth-ranked model on Chatbot Arena narrowed from 11.9% in 2024 to 5.4% in early 2025. A single high-quality benchmark like FrontierMath reportedly costs millions to build and may saturate within one to two years.

This is not the trajectory of a maturing science. It is the trajectory of a trust deficit producing recursive measurement.


Where the diagnosis breaks

The honest counter is that some benchmark proliferation is genuinely capability-driven. Multimodal models really do require multimodal benchmarks. Coding agents really do need software-engineering benchmarks. Reasoning models really do need reasoning evaluations. The Stanford HAI AI Index 2025 notes new benchmark categories emerging in 2024–2025 for web navigation, software operation, multi-tool use, and computer use — capabilities that didn’t meaningfully exist three years ago. New capacity needs new tests; that part is real.

But capability expansion alone doesn’t explain volume. Roughly four benchmarks per task (3,765 across 947), with about half abandoned, is not the demand curve of a discipline tooling up to measure new abilities. It’s the demand curve of an institution producing trust signals that decay too fast to amortize. Capability-driven benchmark creation would not, by itself, generate the gaming, the contamination, the sandbagging, the safetywashing — the documented set of strategic responses that exist precisely because the benchmarks have political-economic stakes far beyond their epistemic value. Ren and colleagues (2024) found that several widely used safety benchmarks — including ETHICS, TruthfulQA, GPQA, and MT-Bench — correlate so strongly with general capability that scoring well on them barely tests safety at all. The term they coined was “safetywashing,” and it would not be necessary if the benchmarks were primarily epistemic instruments.

Porter’s framework doesn’t say measurement is useless. It says measurement proliferates under low trust, and the excess — the part beyond what genuine evaluation would require — is what carries the diagnosis.


The prediction worth tracking

If Porter is right, the framework generates a specific testable prediction. When trust networks form in AI research — vetted consortia, closed evaluation partnerships, shared internal infrastructures — the demand for new public benchmarks should drop.

A few candidate networks are now visible. The U.S. National Institute of Standards and Technology built an AI Safety Institute Consortium (since reorganized as the Center for AI Standards and Innovation) that, per NIST’s 2024–2025 documentation, includes more than 200 organizations sharing evaluation methodology, red-teaming standards, and safety measurement protocols. In May 2024, the AI Seoul Summit launched an International AI Safety Institute Network spanning the UK, US, Japan, France, Germany, Italy, Singapore, South Korea, Australia, Canada, and the EU; a NIST fact sheet from November 2024 describes a Joint Evaluation Protocol for assessing frontier models across that network. In August 2024, NIST signed bilateral pre-release testing agreements with Anthropic and OpenAI. The Frontier Model Forum operates a separate track focused on shared evaluation methodology and risk-mitigation transparency.

These are exactly the kind of structures Porter would identify as trust infrastructure. They permit private negotiation between vetted parties — the move that nineteenth-century British actuaries used to keep their work judgment-based. The empirical question is whether their formation reduces the rate at which new public benchmarks are created.

If the rate drops as trust networks mature, Porter’s framework is confirmed and the field is stabilizing. If it stays constant or increases, the trust deficit is unresolved and the consortia themselves are performing trust signaling rather than building it. The current best read sits somewhere in between: 200+ organizations is a lot of trust infrastructure on paper, but the rate of new benchmark creation is not visibly slowing.

This is the actionable handle for builders and policymakers. Persistent demand for new public benchmarks is a trailing indicator of an unresolved trust deficit. Falling demand — benchmarks aging gracefully into stable infrastructure rather than burning through twelve-month half-lives — would be the leading indicator of industry maturation. Don’t celebrate benchmark proliferation as methodological progress. Diagnose it.


The recursive joke

Which brings us back to BetterBench.

The most coherent reading of BetterBench, in Porter’s framework, is that it is the recursive endpoint of trust-deficit-driven quantification. The benchmarks are gameable, so we built a meta-benchmark to score the benchmarks on validity. The meta-benchmark is also a benchmark. It will saturate. Someone — and it is only a question of when — will publish a meta-meta-benchmark to score the meta-benchmarks on construct validity. This is not satire. It is the predicted output of a system trying to solve a trust problem with more measurement.

The way out, in Porter’s framework, is unsexy and slow. Trust networks form when peers actually evaluate one another’s work and bear professional consequences for the result. Closed evaluation between labs, where each can hold the others accountable, substitutes for the public test. Shared internal infrastructure that doesn’t get published — because it doesn’t need to be a marketing artifact — is a sign that the field has begun to trust itself. The British actuaries who refused standardization didn’t have a better measurement methodology. They had a guild.

For developers and tech leaders, the practical takeaway is the diagnostic stance, not a prescription. When the next benchmark drops — and one will drop next week, and the week after — the right question isn’t “how does my model score.” It is: what trust gap is this benchmark filling, and would it exist if the labs trusted each other, or if anyone trusted the labs? If the answer is no, you are looking at mechanical objectivity in its purest form. The number is real. So is the negotiation underneath it.

The most popular AI benchmark just scored lowest on the first attempt to evaluate AI benchmarks. The Corps of Engineers would have recognized the pattern immediately. Trust, as Porter showed, is the thing being substituted for. The substitute can’t fix the thing.


Sources

  1. Theodore M. Porter, Trust in Numbers: The Pursuit of Objectivity in Science and Public Life (Princeton University Press, 1995).
  2. Kapoor & Narayanan et al., BetterBench: Assessing AI Benchmarks — project page and ranking (NeurIPS 2024).
  3. Ott et al., “Mapping global dynamics of benchmark creation and saturation in artificial intelligence,” Nature Communications 13:6793 (2022).
  4. Karpathy quote and benchmark-creation statistics: MIT Technology Review, “Why AI benchmarks are broken” (May 2025).
  5. Ahmed, Wahed, & Thompson, “The growing influence of industry in AI research,” Science 379:6635 (2023).
  6. McIntosh et al., “Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence,” arXiv:2402.09880 (2024).
  7. Gehrmann, Clark, & Sellam, “Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text,” JAIR (2023).
  8. Church & Hestness, “A Survey of 25 Years of Evaluation,” Natural Language Engineering (2019).
  9. Orr & Kang, “AI as a constituted system: accountability lessons from an LLM experiment,” Data & Society (2024).
  10. Narayanan & Kapoor, “GPT-4 and professional benchmarks: the wrong answer to the wrong question,” AI Snake Oil (2023).
  11. Pfister & Jud, “Understanding and Benchmarking Artificial Intelligence,” preprint (2025).
  12. Weij et al., “AI Sandbagging: Language Models can Strategically Underperform on Evaluations,” arXiv:2406.07358 (2024).
  13. Starace et al., “PaperBench: Evaluating AI’s Ability to Replicate AI Research,” OpenAI / arXiv (2025).
  14. Reuel et al., “BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices,” NeurIPS Datasets & Benchmarks (2024).
  15. Zhang et al., “A Careful Examination of Large Language Model Performance on Grade School Arithmetic,” arXiv:2405.00332 (2024).
  16. Ren et al., “Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?” arXiv:2407.21792 (2024).
  17. Stanford HAI, AI Index Report 2025, chapters on technical performance and responsible AI.
  18. NIST, “Artificial Intelligence Safety Institute Consortium” documentation; CAISI reorganization 2025.
  19. NIST fact sheet, International Network of AI Safety Institutes Joint Evaluation Protocol (November 2024).
  20. Frontier Model Forum, evaluation-methodology workstream documentation (2024–2025).

Source note: Porter’s framework is applied to AI research as a structural analogy, not a strict historical claim. The 12–24 month benchmark replacement cadence is computed from the saturation timelines of GPQA, MMLU-Pro, and FrontierMath as cited; individual benchmarks vary. Trust-network outcomes (whether NIST/CAISI or Frontier Model Forum reduce public-benchmark demand) are the testable prediction the essay rests on, not a confirmed result.

If trust networks are the way out, build the trust infrastructure.

Porter’s framework predicts that the demand for new public benchmarks falls when peers can actually evaluate one another and bear consequences for the result. That requires three things: signed identity (who actually ran this), portable reputation (what their track record actually is), and tamper-evident provenance (what they actually did). The Agent Trust Stack is our open-source attempt at exactly that — identity (Chain of Consciousness), reputation (Agent Rating Protocol), and the provenance layer that ties them together. It doesn’t replace benchmarks. It builds the substrate where benchmarks can finally stop being the only signal.

pip install agent-trust-stack · npm install agent-trust-stack
See Hosted CoC →