One number determines whether your evaluation panel helps or degrades. The math was settled before the French Revolution.
In the summer of 1785, Marie Jean Antoine Nicolas de Caritat — the Marquis de Condorcet — published an essay with a title so long it practically constituted an abstract: Essai sur l’application de l’analyse à la probabilité des décisions rendues à la pluralité des voix. Buried inside was a theorem about majority voting that splits cleanly into two results. The first result is famous. The second one should keep anyone deploying a multi-agent evaluation panel up at night.
The optimistic half: if each juror independently has probability p > 0.5 of choosing correctly on a binary decision, then the probability that a majority votes correctly increases monotonically with the number of jurors, approaching certainty as the panel grows. This is the mathematical backbone of “wisdom of crowds” — more independent, modestly competent voters produce better collective decisions.
The devastating inverse: if p < 0.5, the majority probability is strictly decreasing in panel size and approaches zero as the panel grows. Each additional juror shifts more probability mass toward the wrong answer. The group’s error rate doesn’t just increase — it increases exponentially with every voice you add. The optimal jury size, when individual accuracy falls below 50%, is exactly one.1,6,7
One number — your agent’s task-specific accuracy — determines which side of the theorem you’re on. There is no “it depends on the architecture.” There is no “it helps with diversity.” Below 0.50, more agents means more wrong. The proof is a direct summation of binomial probabilities. It was settled before the French Revolution.
The multi-agent evaluation industry runs on an assumption that sounds so reasonable it barely needs defending: more judges produce better judgments. Three LLM evaluators must be better than one. Five must be better than three. Put a panel together, take the majority vote, and the errors will cancel out.
Condorcet showed this is exactly half right. The errors cancel out only when each judge is individually more accurate than a coin flip on the specific task being judged. That specificity matters — you can’t ask “are LLMs good judges?” and get a useful answer. You have to ask “are LLMs good judges of this particular thing?”
JudgeBench, a benchmark for evaluating LLM-based judges presented at ICLR 2025, provides the answer at task-level granularity. When GPT-4o evaluated correctness-grounded pairs using vanilla prompting, its accuracy varied dramatically by domain:3
| Domain | GPT-4o Accuracy |
|---|---|
| Knowledge | 44.2% |
| Reasoning | 48.0% |
| Math | 66.1% |
| Coding | 61.9% |
Math and coding sit comfortably above 50% — panels help there. Knowledge sits at 44.2% and reasoning at 48.0%. Both are below the Condorcet threshold. On these task classes, adding more LLM judges to a panel mathematically decreases accuracy.
Separately, the SAGE framework confirmed the pattern from a different angle: across all evaluated models, evaluation consistency between direct scoring and pairwise comparison collapsed from 78–84% alignment on easy tasks to 32–43% on hard ones.4 A panel that helps on easy discriminations — where your judges are likely above 50% — actively hurts on hard ones, where accuracy drops below the threshold. You’re improving decisions you’d probably get right anyway while degrading the decisions where accuracy matters most.
Take GPT-4o’s 44.2% accuracy on knowledge evaluation tasks and compute the majority-correct probability for panels of increasing size. The binomial math is mechanical:
Each judge you add makes the panel worse. Not because 44.2% represents random guessing — it reflects the genuine difficulty of knowledge-domain discrimination — but because the judges are on the wrong side of the threshold. The binomial distribution doesn’t care about your intentions. It doesn’t care about your architecture. It cares about one number.
This is the same math that governs electoral theory, which is where the political science literature becomes unexpectedly relevant. The Stanford Encyclopedia of Philosophy’s entry on jury theorems notes that complex decisions — where voters face multidimensional tradeoffs without clear expertise — push individual competence toward or below 0.50. Rational ignorance, a phenomenon first formalized by Anthony Downs in 1957, means voters minimize information costs when their individual vote has negligible impact on the outcome, further eroding competence.8,9
The LLM judge evaluating a subtle quality difference between two near-identical outputs is in exactly the same position as a voter choosing between two complex policy bundles. The information needed for a correct judgment exceeds the information available in the prompt. The judge is making its best guess — and on hard discriminations, its best guess is wrong more often than it’s right.
Condorcet’s theorem requires two conditions: individual competence above 50%, and statistical independence between voters. LLM agent panels violate both simultaneously.
In 2024, Araújo et al. directly tested whether LLM ensembles satisfy the independence assumption. They evaluated GPT-4, GPT-3.5, DistilRoBERTa, and FinBERT on financial sentiment classification, checking all four Condorcet conditions explicitly. Three held — identical distribution, performance above random, and uniform error distribution. Independence failed. The authors concluded that “advanced LLMs like GPT-4 demonstrate significant overlap in the decision-making processes” with smaller models.2 The models aren’t independent voters. They’re correlated jurors who tend to be wrong about the same things.
This matters because independence is what makes crowds wise. Scott Page’s diversity prediction theorem formalizes it: collective error equals average individual error minus prediction diversity.10 When voters are correlated, prediction diversity collapses and the crowd offers no advantage over its average member. A panel of five correlated judges might have the effective diversity of one and a half independent judges — all the coordination cost, none of the statistical benefit.
The information-theoretic case arrives at the same conclusion from a different direction entirely. Tran and Kiela showed that splitting a task across agents triggers the Data Processing Inequality: conditioning on messages leaves more residual uncertainty about the correct answer than conditioning on the full context.5 When agents communicate through compressed messages — as they must in any multi-agent architecture — information is necessarily lost. A single agent retaining full context has a theoretical advantage that no amount of architectural cleverness can eliminate.
Their experimental results confirmed the theory. On multi-hop reasoning tasks at equal thinking-token budgets, single agents consistently outperformed multi-agent systems across five architectures: sequential, subtask-parallel, parallel-roles, debate, and ensemble. Multi-agent systems became competitive only when single-agent context utilization was deliberately degraded with heavy masking or substitution noise. The multi-agent structure helps only when the single agent is artificially handicapped.
Condorcet says that below 50%, more agents make the group worse. The Data Processing Inequality says that splitting the problem makes each agent worse individually. Both forces compound. Neither requires the other to operate.
There is an irony here. The machine learning community solved this problem decades ago.
Random forests work precisely because they engineer both Condorcet conditions. Each tree is trained on a bootstrap sample of the data — a random subset with replacement — ensuring partial decorrelation between trees. Each tree considers only a random subset of features at every split, injecting diversity in what the trees “see.” The algorithm doesn’t just add more trees. It actively decorrelates them. Leo Breiman understood in 2001 what many agent architects still haven’t internalized: the value of an ensemble comes from the independence of its members, not from their number.11
When independence holds and each tree is above 50% accurate, adding trees increases forest accuracy toward certainty — Condorcet’s optimistic case playing out in a classifier. When trees are correlated, adding them yields diminishing returns. The Condorcet threshold is the mathematical reason bagging works, even though Breiman didn’t frame it that way.
The agent evaluation community skips both steps. Multi-agent panels typically deploy multiple instances of the same model (no architectural diversity), prompted with the same or similar instructions (no information diversity), evaluating outputs they’re all equally uncertain about (no competence advantage). The result is a jury of correlated jurors, each performing below 50% on hard tasks, aggregated by majority vote. Condorcet’s 1785 proof tells you exactly what happens next.
The practical prescription follows directly from the math.
Before deploying a multi-agent evaluation panel, measure your agent’s task-specific accuracy on a representative sample of the class of judgments you need. This is the only step that matters, and almost nobody does it.
If p > 0.5 and agents are genuinely independent, a panel helps — add judges. If p > 0.5 but agents are correlated, decorrelate before scaling: different models, different prompting strategies, different information subsets — like a Random Forest decorrelates its trees with bootstrap sampling and feature randomization. If p < 0.5, use one agent. Not a panel of one. One judge, no aggregation overhead, no coordination cost, no information loss from message compression. The money and compute you save on the other four judges? Spend it making the one judge better. Fine-tune it. Give it better context. Increase its thinking budget. A single GPT-4o at 44.2% is strictly better than three GPT-4o instances at 41.3%.
The JudgeBench data reveals something else worth internalizing: the threshold varies by domain within the same model. GPT-4o is in Condorcet’s good zone for math (66.1%) and coding (61.9%), and in the bad zone for knowledge (44.2%) and reasoning (48.0%). A team using the same panel architecture for all evaluation tasks is simultaneously improving its math judgments and degrading its knowledge judgments — and probably doesn’t know which is which, because nobody measured p.
In 1785, Condorcet wasn’t thinking about language models. He was thinking about juries, parliaments, and the conditions under which democratic bodies could reliably discover truth. His answer was a conditional: yes, but only when the voters can see clearly enough to be right more often than they’re wrong.
Two hundred and forty-one years later, the technology industry is scaling agent evaluation panels on the unexamined assumption that more eyes produce better judgments. Condorcet proved that this is true only when each eye sees clearly — when individual accuracy exceeds the coin-flip threshold on the specific judgment being asked. On the hardest evaluations — knowledge discrimination, reasoning assessment, the subtle qualitative comparisons where automated evaluation matters most — each eye is blurry. Three blurry judges at 41.3% aren’t three times as trustworthy as one. They’re measurably less trustworthy than one judge at 44.2%, and the gap widens with every judge you add.
The fix isn’t more agents. It’s one better agent — or, failing that, agents that are wrong about different things.
The theorem reduces multi-agent evaluation to one number. Measure it.
Condorcet’s proof says one thing matters: p, your agent’s accuracy on the specific task class. Below 0.50, every additional judge degrades the panel. Above 0.50, every judge improves it. The entire multi-agent evaluation question collapses to whether you’ve measured that number. Agent Rating Protocol provides the infrastructure — per-task accuracy recorded as signed, portable ratings anchored to the specific domain and judgment class. When you need to know whether p exceeds the Condorcet threshold for knowledge evaluation, the answer comes from measurement infrastructure, not from the assumption that more judges must be better.
pip install agent-rating-protocol · npm install agent-rating-protocol
See a verified agent rating →