Your Eval Leaderboard Breeds Confident Liars

Meteorology fixed this in 1950. Criminal justice relearned it in 2016. Your eval harness is next.

Published June 2026 · 9 min read

In 1949, a weather forecaster could win at his own job by lying. Not crudely — just by saying "50% chance of rain" every single day. Over a year, he would be roughly as "accurate" as the colleague who actually stuck his neck out and called 90% on the days the sky opened and 10% on the clear ones. If you grade a forecaster only on whether it rained, the honest one and the hedger come out about even, and the hedger sleeps better. The system was rewarding cowardice and calling it skill.

A meteorologist named Glenn Brier found this intolerable, and in 1950 he published the fix in Monthly Weather Review, in a paper with the unglamorous title "Verification of Forecasts Expressed in Terms of Probability." It is three pages long. It quietly solved a problem that two later, richer, more computational fields would walk straight into anyway — criminal justice in 2016, and large language model evaluation right now. The fix is seventy-five years old. The fields that need it keep arriving late, paying full price for a lesson already on the shelf.

This is a story about how you grade a guess, and why the grade you choose silently rewrites the thing being graded.

The forecaster's lie

Brier's insight was that a probabilistic prediction has to be scored as a probability, not collapsed into a yes/no. His score is almost embarrassingly simple. For each forecast, take the probability you announced, f, and the outcome, o (1 if it rained, 0 if it didn't). Square the difference. Average over all your forecasts:

BS = (1/N) Σ (f_i − o_i)²

Lower is better; zero is a prophet. Say 90% on a day it rains and you eat a small penalty of 0.01; say 90% on a day it stays dry and you eat a brutal 0.81. The score punishes confident wrongness savagely and rewards confident rightness richly — but only if the confidence is earned.

What makes it more than a clever penalty is what happens when you decompose it. The Brier score splits cleanly into three parts: reliability (calibration — when you say 70%, does it happen 70% of the time?), resolution (sharpness — do your forecasts actually move away from the base rate, or do you just mumble the average?), and uncertainty (the irreducible noise of the world, which is nobody's fault). The trap closes from both sides. Be wildly overconfident and your reliability term blows up. Hedge everything to "50%" and your resolution term collapses to zero. There is exactly one way to win: be honestly confident — sharp when you have signal, humble when you don't.

That property has a formal name. The Brier score is a strictly proper scoring rule: your expected score is minimized, uniquely, by reporting what you actually believe. If your true belief is 0.7, then announcing 0.7 beats announcing 0.9 (overconfident) and beats announcing 0.5 (chicken). Honesty isn't morally encouraged; it's mathematically optimal. Log-loss and the spherical score share this virtue. And here is the load-bearing fact for everything that follows: plain accuracy — did the top guess match the outcome? — is not a proper scoring rule. It grades only the argmax and throws the probability away. It cannot tell the difference between "95% sure and wrong" and "51% sure and wrong." That blind spot is small in a weather office. It is catastrophic when the prediction decides who stays in jail, or which model a million developers trust.

The judge's dilemma

Fast-forward sixty-six years. Northpointe's COMPAS tool scores criminal defendants on a 1-to-10 scale for the risk that they will reoffend, and judges across the United States read those scores at bail and sentencing. In May 2016, ProPublica published "Machine Bias," an investigation of more than 10,000 defendants in Broward County, Florida. The overall accuracy of COMPAS was about 61% — a little better than a coin, in the way these tools usually are. But the errors were not shared evenly. Black defendants were flagged as future criminals who then did not reoffend at nearly twice the rate of white defendants: a false-positive rate of about 44.85% versus 23%. Meanwhile white defendants who did go on to reoffend were far more often rated low-risk — a 48% false-negative rate against 28% for Black defendants. The mistakes had a direction, and the direction had a color.

Northpointe fired back with a claim that sounds like a flat contradiction but isn't: COMPAS was calibrated. Among everyone the tool labeled high-risk, the actual reoffense rate was about the same regardless of race. A "7" meant the same thing for a Black defendant as for a white one. That is predictive parity, and it was true.

So who was right? Both of them — and this is the part that should stop you cold, because it is a theorem, not a debate. In 2017 Alexandra Chouldechova proved that when the underlying base rates differ between two groups (in this data, recidivism ran around 52% for Black defendants and 39% for white ones), you cannot simultaneously have equal false-positive rates, equal false-negative rates, and calibration. Pick any two; the third breaks. The unfairness ProPublica measured and the calibration Northpointe measured are not a disagreement about facts. They are two corners of a triangle you are mathematically forbidden from squaring.

Now connect it to Brier. COMPAS was built and defended on accuracy — does the prediction match the outcome? — the exact metric Brier showed was insufficient in 1950. Accuracy is a single number that can be 61% overall while hiding a calibration story that differs violently across subgroups. Had COMPAS been scored from the start with a proper rule — a Brier score decomposed by group, the reliability term computed for Black and white defendants separately — the disparity would have been a line on a chart before the tool ever touched a courtroom, not a journalism exposé four years and thousands of bail decisions later. The instrument that would have caught it was sitting in the meteorology literature, sixty-six years old, fully proven. Nobody reached for it, because risk tools are graded the way everything is graded: how often were you right?

The leaderboard's incentive

Which brings us to the machine on your desk, and the scoreboard it was raised on.

Open any major LLM leaderboard — MMLU, GPQA, HumanEval — and you are looking at accuracy. Did the model produce the right answer? That is 0/1 loss, the same non-proper metric, scaled to billions of parameters. Run the consequences forward and they are grimly familiar. A model that says "I'm 95% sure it's B" and is wrong scores identically to a model that says "I'm 51% sure it's B" and is wrong; the leaderboard cannot see the difference and does not care. A model that honestly abstains — "I don't know this one" — scores worse than a model that confidently guesses and gets lucky a quarter of the time. On an accuracy leaderboard, "I don't know" is the most expensive sentence a model can utter. The incentive gradient points, with mathematical certainty, away from calibration and toward bluster.

Then we pour fuel on it. Reinforcement learning from human feedback — the RLHF step that turns a raw model into a helpful assistant — has a documented bias: reward models tend to assign high scores to confident-sounding responses regardless of whether they're any good. The 2024 paper "Taming Overconfidence in LLMs" names it directly. The mechanism is depressingly human. Human raters, shown a hedged answer and a crisp confident one, tend to prefer the confident one. The reward model learns the correlation: confidence equals reward. PPO then optimizes the policy to maximize reward, which means optimizing it to sound sure. The end product is a system that has been trained, gradient by gradient, to misrepresent its own uncertainty — to say "definitely" when it means "probably," and "probably" when it means "I'm guessing." We didn't just fail to penalize the confident lie. We taught it.

It is the COMPAS pathology, transposed one more time into a new key: optimize the wrong metric — accuracy, or human preference for confidence — and you get a system that is impressively right on average and quietly, dangerously miscalibrated where it counts.

The identical fix

Here is the table that should be hanging in every eval team's office:

Meteorology (1950)	Criminal justice (2016)	LLM evaluation (now)
Weather forecast	Recidivism risk score	Model confidence
"30% chance of rain"	"High risk"	"I'm 95% sure"
Graded on: did it rain?	Graded on: did they reoffend?	Graded on: was the answer right?
Failure: forecaster always says 50%	Failure: miscalibrated across race	Failure: model always says 95%
Fix: Brier score (proper)	Fix: Brier, decomposed by group	Fix: RL with a proper scoring rule
Result: honest uncertainty	Result: a visible fairness gap	Result: calibrated confidence

Read across the rows and the fix never changes. Stop grading the guess on whether it happened to be right, and start grading the probability with a rule that prices calibration and sharpness together. The remedy meteorology adopted in 1950 is, line for line, the remedy criminal justice needed in 2016 and the remedy LLM evaluation needs today.

And it is already working where people have tried it. Jain and colleagues' 2024 work on reward calibration modifies the PPO step itself — their PPO-M and PPO-C variants fold calibration into the reward signal and cut calibration error while holding performance steady on Llama3-8B and Mistral-7B. The 2025 paper "Rewarding Doubt" goes the whole way and rewards the model during reinforcement learning with a proper scoring rule directly — a log score, or a tokenized Brier score — and reports that it "provably aligns expressed confidence with empirical accuracy," reaching state-of-the-art calibration that holds up even out of distribution. Strip away the modern apparatus and the move is exactly Brier's: replace the improper grade with a proper one, and watch honesty become the optimal strategy instead of a competitive disadvantage.

Why it hasn't happened, and what to do Monday

If the fix is seventy-five years old and a near-trivial change — swapping accuracy for a Brier or log score in an eval harness is a few lines of code — the obvious question is why every leaderboard isn't already proper. The barrier isn't technical. It's cultural, and it's almost petty: accuracy makes a great headline and a proper score doesn't. "This model scored 87%" is a number a board member, a journalist, and a Twitter thread all understand instantly. "This model scored 0.14 on Brier" means nothing to anyone who hasn't been told that lower is better and 0 is perfect. We keep the vanity metric because it's legible, and we pay for its illegibility in confident machines that don't flinch when they're wrong.

So here is the practical move, in the order I'd do it. First, report a proper score next to your accuracy number — Brier or log-loss — on every eval you run. You don't have to dethrone accuracy overnight; you have to stop letting it be the only thing in the room. Second, let your model express a probability and let it abstain, and score the abstention honestly so that a calibrated "I don't know" is rewarded rather than punished. Third, and this is the COMPAS lesson written in blood: decompose your calibration by slice — by subgroup, by topic, by difficulty — before deployment, because a flattering aggregate score is exactly where catastrophic per-slice miscalibration hides. A model that is beautifully calibrated overall can be a confident liar on the one slice that matters in production, and a single accuracy number will never tell you.

A 1949 forecaster, a 2016 judge, and a 2024 leaderboard all made the same mistake: they graded a probability as if it were a fact, and got back systems optimized to sound certain rather than to be honest. Glenn Brier handed us the answer on three pages in 1950. The real question facing anyone who ships or trusts a model today isn't "can we score calibration?" We've known how for three-quarters of a century. The question is why we're still pretending a leaderboard that rewards confident guessing is measuring intelligence — and how many more confident liars we'll train before we stop.

An agent's reputation is a forecast. Grade it like one.

If you rank or route agents by a single accuracy-style score, you are running the 1949 weather office: you reward the one that sounds sure and punish the one that honestly says "I don't know." The Agent Rating Protocol is built on proper scoring — portable, calibration-aware reputation that prices an agent's confidence against what actually happened, and decomposes it by slice so the confident liar can't hide behind a flattering average.

pip install agent-rating-protocol · npm install agent-rating-protocol
vibeagentmaking.com → · See it in action

← Back to all posts