Aa1 Is Not One Unit Better Than Aa2: The Ordinal-Scale Error in Every Leaderboard

Moody's will tell you in writing that Aa1 minus Aa2 is undefined. Your ML leaderboard computes it to one decimal place and ships.

Published June 2026 · 10 min read

Moody's rates corporate bonds on a 21-notch scale: Aaa at the top, then Aa1, Aa2, Aa3, A1, and on down through Caa and Ca to C at the bottom. It looks exactly like a ruler. There are even little numeric modifiers bolted onto the letters (Aa1, Aa2, Aa3), practically daring you to do arithmetic. So here is a question that sounds trivial and isn't: how much better is Aa1 than Aa2?

The honest answer, which Moody's itself will tell you in writing, is that the question is malformed. Aa1 is not "one unit" better than Aa2, because there is no unit. The scale measures order, not distance. And you can prove the gaps are unequal from the agency's own data: in Moody's Idealized Default Rates, the difference in expected default probability between adjacent top notches is almost nothing (Aaa and Aa1 are, in any practical sense, both "won't default") while the difference between two adjacent notches down in the B and Caa range is enormous, the gap between "probably fine" and "actively dying." The same one-notch step is worth a rounding error at the top and a catastrophe at the bottom. The "unit" is a fiction.

I find this delightful, because the institution that looks most like it's wielding a precision instrument is the one that insists hardest, and most correctly, that it is not. And once you see that discipline clearly, you start noticing its absence everywhere else, most glaringly in the place that prides itself on rigor: the machine-learning leaderboard. The bond raters, it turns out, are more statistically disciplined about their scores than your AI dashboard is about its own.

The law that named the mistake in 1946

The error has a name, and it's old. In 1946 the psychologist S. S. Stevens published a short paper, "On the Theory of Scales of Measurement," that sorted all measurement into four levels: nominal (categories with no order: blood types, programming languages), ordinal (ranks with order but no fixed spacing: finishing positions, Likert "agree/strongly agree," bond notches), interval (equal spacing but no true zero: Celsius), and ratio (equal spacing and a true zero, so ratios are meaningful: length, mass, dollars).

The whole point of the taxonomy is that the level dictates which arithmetic is legal. With ordinal data, Stevens argued, you may report the median and the mode; you may not legitimately compute a mean or a standard deviation, because the mean assumes the distance between rank 1 and 2 equals the distance between rank 2 and 3, and for ordinal data, that assumption is simply false. Dispersion is described with percentiles and quartiles, not SD. "First, second, third" tells you the order of finish; it does not tell you the winner beat the runner-up by an inch or a mile.

This is Stats 101. It is taught in every introductory methods course. And two large, sophisticated, money-soaked industries do diametrically opposite things with it: one has built its entire credibility on obeying it, and the other breaks it on the front page of its results every week.

The domain that got it right: credit ratings

Credit rating is, at bottom, an exercise in not fooling yourself with a number, and the field has built structural guardrails to enforce that.

They state the scale type explicitly. Moody's is unambiguous that its ratings "measure ordinal credit risk, not cardinal," that they are "opinions of ordinal, horizon-free credit risk," and, this is the line engineers should tattoo somewhere, that they "do not target specific default rates or expected loss rates." The rating is a rank of relative risk meant to be consistent across thousands of issuers, not a measurement you can subtract.

They keep separate constructs on separate scales. How likely a default is (probability of default) and how bad it would be if it happened (loss given default, recovery) are two different things, so they get two different scales. The agencies do not fold "how much you'd lose" into the "how likely you are to lose it" letter grade, because mashing two incommensurable constructs into one ordinal destroys the meaning of both. A bond's default-likelihood rating and its recovery rating are reported independently and read independently.

They encode direction as a label, not a number. A rating "outlook" is positive, negative, or stable; a "watch" flags a likely near-term move. These are directional signs, not magnitudes. The agencies never write "outlook: +0.4." Direction is order and sign; it is not arithmetic.

And when they absolutely must produce a cardinal number, they map explicitly and non-linearly. To turn ordinal ratings into the probabilities a risk model needs, the standard tool is an ordered-probit or ordered-logit model, which, importantly, relaxes the assumption of an equally spaced scale by fitting endogenous "break points" between notches instead of treating the gaps as equal. Moody's Idealized Default Rates assign each notch its own empirically distinct default rate, varying by category and horizon. The translation from order to magnitude is allowed, but only through an explicit, validated, deliberately non-linear map. Nobody is permitted to assume the notches are evenly spaced, because they provably are not.

Four habits, one principle: the scale measures order; treat it that way, and cross to magnitudes only through a labeled door.

The domain that gets it wrong: the ML leaderboard

Now open any model leaderboard and watch every one of those guardrails get run over.

A leaderboard takes a handful of benchmarks (a coding suite, a math set, a reading-comprehension task, a safety eval) that measure genuinely incommensurable things on genuinely incommensurable scales, and it averages them into a single number. Then teams report movement on that number to one decimal place: "we gained 0.3 points." But an average of incommensurable benchmarks is a cardinal operation, addition and division, performed on a composite where the intervals are undefined. The number has the syntax of a measurement and the semantics of a vibe.

And here is the part that should genuinely alarm anyone who ships models against these scores. Multiple 2024–2026 analyses of LLM leaderboards have found that the standard deviation across ten runs of a single model often matches or exceeds the entire performance gap among the top-ten models. Read that again: the run-to-run noise of one model is frequently larger than the difference between the first-place and tenth-place models. The "+0.3 points" your team celebrated is, with depressing regularity, inside the error bars of the same model evaluated twice. The paper "When Benchmarks are Targets" (arXiv:2402.01781) documents how sensitive these rankings are to minute details; another 2025 result, "Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings" (arXiv:2508.11847), shows that removing a tiny slice of the human votes that drive a popularity-style leaderboard can flip who's on top. Researchers have also found that identical model weights can score ten to twenty percentage points apart depending only on which evaluation harness you run them through, the same model, scored on the same task, twenty points apart because of plumbing.

So when you treat "ranked #2" as "twice as far from #1 as #3," you are not just committing Stevens' ordinal-as-interval error in theory. You are reading distance into a ranking whose order itself is unstable under resampling. A ranking that reorders when you drop a handful of examples was never a measurement; it was a snapshot of noise wearing a number's clothing.

The tell, the thing that makes this almost funny, is that the ML field has independently reinvented the fix without ever naming the cause. When researchers want a leaderboard that doesn't lie, they reach for rank-based aggregation: compute each method's rank on each dataset and average the ranks (or take the harmonic mean of ranks); report Spearman or Kendall rank correlation between benchmark versions; some even import "psychometric methodology" wholesale to fix leaderboards (arXiv:2501.17200). Every one of those is the ordinal-data toolkit Stevens prescribed in 1946 and the credit agencies have enforced for decades. The field is rediscovering measurement theory from first principles, one painful leaderboard scandal at a time, and it still hasn't picked up the eighty-year-old name for the disease.

The honest caveat (so a statistician can't ambush you)

Now, the rule is not the dogma "you may never average ordinal data." If you state it that strongly, a good statistician will, correctly, push back, and you should know why before you write the confident version.

Stevens himself hedged: he admitted that taking the mean of ordinal data "will in many cases lead to fruitful results," and his prohibitions "have not been generally endorsed by statisticians." There's a real, defensible practice of treating composite scales as interval: sum twenty Likert items into one score and, by the central limit theorem, the sum behaves approximately like a continuous, roughly normal quantity you can average without much sin. And not every leaderboard is noise: the LiveBench team reports rank correlations above 0.997 between updates, meaning their rankings are genuinely stable, while older benchmarks show visibly weaker correlations (SQuAD around τ=0.93) under the same scrutiny. Stability is achievable; it just has to be demonstrated, not assumed.

So the honest, bulletproof version of the claim is not "never average." It is this: the burden of proof is on you to show the intervals are equal before you treat them as equal, and for star ratings, for averages of incommensurable benchmarks, and for rating notches, they provably are not. Aa1-minus-Aa2 is undefined not because averaging is forbidden but because someone checked, and the gaps are wildly unequal. The discipline is doing the check, not skipping the arithmetic.

The one question to ask before you average anything

Here is the portable version, the diagnostic you can run on any dashboard, scorecard, KPI, or leaderboard in about thirty seconds. Before you average, subtract, threshold, or rank-by-distance, ask the ratings question: is this scale cardinal (the intervals genuinely mean something) or ordinal (only the order does)?

If it's ordinal, the legal operations are rank-based only: median, percentiles and quartiles, Spearman/Kendall rank correlation, simple up-or-down, and rank-aggregation (average-of-ranks or harmonic-mean-of-ranks across tasks). The operations that quietly corrupt the signal are: the mean, the standard deviation, the numeric delta ("+0.3"), "twice as far as," and the equal-interval threshold ("ship at ≥ 4.0 stars").

Then steal the credit world's three structural habits, because they translate directly into dashboard design:

Separate scales for separate constructs. Don't fold latency, accuracy, cost, and safety into one composite "score," any more than you'd fold recovery into default likelihood. Report them side by side; let the reader hold the trade-off.
Directional labels for direction. "Trending worse" is a sign, not a number. A red arrow that means "this regressed" is more honest than a fabricated "−0.2 health points."
An explicit, non-linear map when you must cardinalize. If a decision genuinely needs a number out of an ordinal scale, build the map deliberately and validate it: a 5-star rating is not linear in satisfaction, and the gap from 1★ to 2★ is not the gap from 4★ to 5★. Fit the break points; don't assume them.

And one rule the credit agencies don't even need because their scales are stable, but yours probably aren't: report the uncertainty. Put a confidence interval on each score and a stability check on the ranking. If your leaderboard reorders when you resample the test set or drop ten examples, you don't have a ranking yet; you have a draw that hasn't admitted it.

Credit-rating discipline	The leaderboard does the opposite
States the scale is ordinal, "not cardinal"	Averages incommensurable benchmarks into one cardinal number
Separate scales for default-likelihood vs recovery	Folds coding, math, safety into one "score"
Direction is a label: positive / negative / stable	Direction is a fabricated number: "+0.3", "−0.2"
Order→magnitude only via a validated non-linear map	Treats every notch as evenly spaced by default
Legal ops: median, percentiles, rank correlation	Illegal ops: mean, SD, numeric delta, "twice as far"

None of this is pedantry, and that's the whole point. The bond raters didn't build their separate-scales, directional-labels, non-linear-map discipline out of fussiness; they built it because trillions of dollars move on these grades and a single laundered ordinal, one "Aa1 minus Aa2" treated as a real subtraction, propagates into mispriced risk. Your stakes are smaller, but the failure mode is identical: a number that looks precise, sits in a dashboard, gets averaged into a quarterly KPI, and drives a real decision, while underneath it the "unit" it's denominated in does not exist. The difference between a signal and a number that merely looks like one is exactly the question Moody's makes you ask out loud, cardinal or ordinal?, and it costs you thirty seconds to ask it before you average. Ask it. The most disciplined people in finance already do, and they're embarrassingly far ahead of the leaderboards.

A score you can't subtract is a rank, and a rank needs a discipline.

If "Aa1 minus Aa2 is undefined" unsettles how you rank agents, that's the right reaction: an agent's standing is ordinal, earned from what it actually did, not a cardinal benchmark you can average and shop. The Agent Rating Protocol is reputation built the credit-rating way, a rank of relative track record, scale type stated, constructs kept separate, no fabricated "+0.3 health points."

pip install agent-rating-protocol · npm install agent-rating-protocol
vibeagentmaking.com → · See it in action

← Back to all posts