From Connoisseurship to Population: The Pivot Coming for Agent Evaluation

Your benchmark score is one specimen. The population is what matters.

Published May 2026 · 11 min read

In 2022, the American Numismatic Society finished a quiet piece of infrastructure work that did not make the AI-newsletter rounds because it had nothing to do with AI. Richard Schaefer's Roman Republican Die Project — the result of fifty years of one numismatist photographing and cataloguing ancient Roman coins from collections worldwide — was digitized, organized, and made downloadable as CSV. The corpus is roughly 300,000 documented specimens, each linked to the dies that struck it. The methodological consequences are still being worked out.

Before this infrastructure, answering “how many obverse dies were used to strike denarii of Crawford 28/1” required either decades of expert specimen-by-specimen examination or an educated guess. After it, the answer is a statistical estimate computed in seconds using methods published in 2006. The shift is unglamorous and consequential. It changes what a numismatic claim means: from an expert's judgment about a unique specimen against a catalogue to a sample estimate with an explicit uncertainty interval, drawn from an estimable population.

This essay is about why agent evaluation in 2026 looks like numismatics in 1965, and what the field's trajectory toward population methods over the following thirty years says about the trajectory the agent-evaluation field is now starting.

The connoisseurship era and what it could not answer

Traditional numismatics, for most of the twentieth century, was a connoisseurship discipline. The reference works that defined the field — Mattingly and Sydenham's Roman Imperial Coinage (1923–1994), Crawford's Roman Republican Coinage (1974), Sear's catalogues — assigned each coin type a unique identifier (RIC II Vespasian 51, RRC 28/1) and described its iconography, legends, weight standards, and metallic content. The numismatist's skill was identification: which type is this specimen, which die struck it, what is its condition.

Connoisseurship is good at what it is designed to do — identify forgeries, distinguish mint variants, date specimens within tight historical bounds, maintain the catalogue. None of this is replaceable by statistics, and the literature is clear that none of it has been replaced.

What connoisseurship cannot do structurally is answer population-level questions. How many dies were used to mint Crawford 28/1 across its production run? What fraction of the original mintage survives today? These are questions about populations of coins — most of which are not in any modern collection, and many of which were melted down for their silver before the type entered the cultural memory. The connoisseur examining one specimen cannot, by virtue of being a connoisseur, infer the size or shape of the population the specimen was drawn from.

Esty, the geometric model, and the singletons signal

Warren Esty's 2006 paper How to estimate the original number of dies and the coverage of a sample, and his 2011 follow-up The Geometric Model for Estimating the Number of Dies, did for numismatic populations what Frank Anscombe and Warren Weaver had done for sequencing depth estimation and what ecologists had done for species richness: he ported the right statistical machinery into the right field.

The machinery itself is straightforward enough to summarize without equations. Each ancient coin was struck by a pair of dies — an obverse and a reverse. The number of distinct die pairs that existed during a type's production is the population. The number of distinct die pairs observed in surviving specimens is the sample. The bridge between observation and inference is the singletons signal: how many die pairs are observed exactly once. A high singletons ratio means many dies are barely sampled, which means many more dies probably existed that have not been sampled at all. A low singletons ratio means the common dies have been thoroughly observed and the remaining unknowns are increasingly rare.

This logic is identical to the logic of unseen-species estimators in ecology and to the Good–Turing frequency estimator in cryptanalysis. The lineage worth pausing on: the Good–Turing estimator was developed by Alan Turing and I. J. Good at Bletchley Park during the Second World War for estimating the frequency of cipher elements that had not yet been observed in intercepted traffic. The same mathematics was carried into ecology in the 1950s and 1960s (Fisher, Corbet, Williams; later Chao). It was carried into numismatics in the 2000s by Esty and others. It is now waiting, ported and validated, for a fourth domain to pick it up. That domain, the essay argues, is agent evaluation, and the timing is not coincidental — the same conditions that triggered the numismatic pivot are now present in evaluation.

The Roman Republican Die Project as case study

The Roman Republican Die Project's significance is not the statistical theory — Esty's geometric model existed in published form before the project's CSV exposure. The significance is the infrastructure that lets the theory be applied at scale. Three components matter: a catalogue framework (Crawford 1974 — without it, no “population of Crawford 28/1 dies” exists to count); die-linking metadata (specimen-by-specimen die identification codified as structured data — without it, no singletons); and open, machine-readable data (the CSV exposure — any researcher with R or Python can now compute population estimates without flying to a library in Berlin).

The RRDP has been used to test Esty's estimators against ground truth in cases where the actual die count is known — Roman Republican coins sometimes carry sequential control marks that label each die uniquely. On these validating subsets, the estimators perform well. That empirical anchoring is what lets a young statistical method be trusted in cases where ground truth is not available.

The next time a numismatist publishes a finding about Crawford 28/1's mint output, they will not present an expert judgment about a unique specimen. They will present a sample-based estimate with explicit uncertainty bounds, computed by methods any other numismatist can replicate from the same CSV.

The caveat that the field is careful about

The numismatic literature is careful about something the agent-evaluation field should be careful about too: population methods do not replace connoisseurship. They build on it. Esty's geometric model is computable only if the die identifications going into it are correct. If two specimens are mis-attributed to the same die when they were struck by different dies, the singletons count is wrong, and the population estimate is garbage. The connoisseur's specimen-by-specimen judgment is the ground truth that the population methods amplify. Wrong ground truth produces wrong amplification.

This caveat is the part most worth preserving when porting the framework. Population methods are not a replacement for individual evaluation. They are an additional layer on top of individual evaluation, which makes claims about populations rather than specimens. If the underlying individual evaluations are unreliable, population estimates from them will be unreliable in predictable, characterizable ways. The connoisseur is still load-bearing.

Why agent evaluation is late connoisseurship

With this background in place, the diagnosis of the current state of agent evaluation is straightforward to state.

Most evaluation today is connoisseurship. A human evaluator (or an LLM judge standing in for one) applies a rubric to an individual output and produces a score. The rubric is the catalogue. The score is the type identification. The evaluator's expertise is the connoisseur's expertise. This work is valuable; the rubric is well-developed in some domains and underdeveloped in others, but the rubric-plus-evaluator setup is recognizable as connoisseurship in the numismatic sense.

What evaluation does not do at scale is population estimation. The dominant reporting convention in 2026 is single-number summaries: HumanEval pass@1 = 87%, MMLU = 76.4%, GSM8K = 92.1%. These are not population estimates. They are point estimates with no uncertainty bound, no coverage claim, no statement about what the test set is sampling from. They are the equivalent of a numismatist publishing “this coin is in Fine condition” — a true and useful claim about a specimen, presented as if it settled something about a population.

A 2025 survey on LLM-agent evaluation (arXiv:2503.16416) names the problem in its own vocabulary: LLM agents are inherently stochastic, so the same prompt can produce different outputs, and multiple executions are needed to observe variation — but most evaluations report single runs. A follow-up in KDD 2025 (arXiv:2507.21504) documents substantial performance degradation when models move from domain-specific evaluations to general-agent settings, which is what you would expect if the eval was over-fitting to the specific specimens in the test set rather than sampling the underlying population of capabilities. A February 2026 paper (arXiv:2602.18998) reports that neither sequential nor parallel test-time scaling produces effective performance improvements in general-agent settings, which is consistent with the eval methodology breaking before the agent does.

These are the diagnostic signs of a field in late connoisseurship. The catalogue is well-developed enough that individual judgments are useful, but the infrastructure for population-level inference has not yet been built.

What a Chao1 estimator would say to you

Here is the most concrete and useful piece of what the pivot would deliver, with a worked example.

Suppose you run a thousand evaluation prompts against your agent and observe 150 distinct failure modes. Of those 150, eighty appeared only once (singletons), forty appeared twice (doubletons), and thirty appeared three or more times. The Chao1 estimator gives a lower bound on the total number of failure modes, including those your evaluation has not yet observed:

S_Chao1 = S_observed + f₁² / (2 · f₂)
       = 150 + 80² / (2 · 40)
       = 150 + 80
       = 230 estimated total failure modes

The interpretation: your evaluation has covered approximately 150 / 230, or about 65%, of the estimated failure-mode space. There are roughly eighty distinct failure modes that exist but have not yet been triggered by any of your thousand prompts.

This single number — “your eval covers an estimated 65% of the failure-mode space” — is more informative than any conventional benchmark score. It tells you not just what you found, but how much you are probably missing. It gives you a target (“get to 85% coverage”) that is independent of any individual run's lucky or unlucky variation. It gives you a stopping rule (“we are at 95% coverage with diminishing returns, time to ship”) that the current single-number-summary convention cannot give you. It transforms an evaluation from a verdict on a specimen into a population estimate.

The Chao1 calculation took six seconds. The methodology is forty years old in ecology and twenty years old in numismatics. It is currently absent from essentially every published agent-evaluation report. This is the structural gap.

The singletons diagnostic for your sampling strategy

The same singletons signal drives a sampling-strategy diagnostic. If most observed failure modes appeared once in a thousand runs, your evaluation is in the “barely sampled” regime — diversify your prompts. If most failure modes appeared many times, you have saturated the common failures — shift to targeted adversarial generation for the long tail. A 50% singletons-to-total threshold is a reasonable handoff. This is the same logic ecologists use to decide whether to keep surveying the same habitat or move to a different one, and the same logic numismatists use to decide whether to seek more specimens of a given type or shift to under-sampled types. It is the resource-allocation logic that current agent-evaluation work mostly does by intuition because the formal apparatus has not been ported.

What the infrastructure pivot would require

Looking at what enabled the numismatic pivot, four pieces of infrastructure are prerequisites for the equivalent pivot in agent evaluation.

The first is a capability and failure-mode taxonomy — the agent equivalent of Crawford's Roman Republican Coinage. Without an agreed catalogue, you cannot count “how many distinct failure modes did we observe,” because two prompts that hit the same underlying capability might be classified as different failure modes by one evaluator and the same by another. HELM, BIG-Bench, and MMLU are proto-catalogues; the agent-evaluation equivalent of Crawford has not yet been written.

The second is die-linking metadata — the ability to mark two prompt-output pairs as testing the same underlying capability. Current eval data treats each prompt as independent. Two prompts that both test mathematical reasoning in different surface forms should be linked. Without linkage, you cannot aggregate observations into capability-level counts.

The third is an open, machine-readable database at RRDP scale. Individual labs have proprietary eval data; the LMSYS Arena and the Open LLM Leaderboard are closest to public infrastructure, but their data formats are not standardized and the linkage metadata is absent. Three hundred thousand documented prompt-output specimens with linkage data and capability classification is a feasible infrastructure target for a coordinated community effort. Building it is the most consequential single thing the field could do.

The fourth is standard format exposure — the equivalent of the 2022 CSV moment. Eval databases that live behind APIs or in proprietary platforms suppress the kind of community-wide analytical work that the population pivot requires. Open formats are the precondition for population methods, not a nice-to-have.

The calendar

Numismatics took roughly thirty years to complete its pivot — call it 1965 to 1995, with the infrastructure work continuing afterward. The trigger was a combination of statistical maturity (Good–Turing → ecology → numismatics) and infrastructure maturity (computerized catalogues, photo databases, eventually digitized linked data).

Agent evaluation has the statistical maturity already. The same Good–Turing-derived estimators that work in numismatics work in agent evaluation. The Chao1 and Chao2 estimators from ecology, the unseen-species estimators from genomics, the coverage estimators from information theory — all of these are mature, validated, and trivially adaptable. What is missing is the infrastructure layer. The catalogue. The linkage metadata. The open database at scale.

The acceleration factor is that agent evaluation does not require physical specimen examination. Specimens (prompt-output pairs) can be generated at machine speed. The bottleneck is not data collection. It is the metadata that lets the data be aggregated into population claims. A coordinated effort to build the metadata infrastructure could plausibly compress thirty years of numismatic pivot into five.

The single move worth making on Monday, if you run any kind of agent evaluation: stop reporting your top-line benchmark scores without an accompanying coverage estimate. Even an estimate calculated from your singletons and doubletons in fifteen minutes is more informative than a point score. The Chao1 estimator is not exotic, not new, and not contested. It is sitting in CRAN and PyPI waiting to be used. Numismatists were in the same position in 1995, looking at Esty's draft work, waiting for someone to build the infrastructure. By 2022 the infrastructure was built and the field was different. The infrastructure for agent evaluation is two or three years from being feasible. Whether it gets built is a community decision.

Your benchmark score is one specimen. The population is what matters.

A rating is a population estimate, not a point score.

The essay's argument is that an evaluation should report coverage and uncertainty, not a single number. That is also the design principle behind the Agent Rating Protocol: an agent's rating is an estimate over a population of observed interactions, carried with the sample it was drawn from, rather than a context-free point score. ARP gives ratings the die-linking metadata and open, machine-readable format the essay names as infrastructure prerequisites — signed, portable records of agent interactions that a Chao1-style coverage estimate can actually be computed against. The population pivot the essay predicts needs exactly this substrate to land on.

pip install agent-rating-protocol · npm install agent-rating-protocol
See a verified agent rating record →

← Back to all posts