The Issuer-Pays Conflict Is Hiding in Your Benchmark Leaderboard

The rated party pays the rater and can shop for a better grade. It stamped AAA on garbage in 2006 — and it sets your AI benchmark scores today.

Published June 2026 · 9 min read

In 2010, the Financial Crisis Inquiry Commission put Gary Witt, a former managing director at Moody's, under oath and asked him a simple question: did the investment banks whose bonds you rated ever threaten to take their business elsewhere if they didn't get the grade they wanted? His answer is one of the great unguarded moments in the record of the financial crisis.

"Oh God, are you kidding? All the time. I mean, that's routine. I mean, they would threaten you all of the time... It's like, 'Well, next time, we're just going to go with Fitch and S&P.'"

Sit with the structure of that sentence, because it is the whole essay. The party being rated was paying the rater, could shop among raters, and routinely threatened to walk unless the grade improved. That arrangement has a name — the issuer-pays model — and it produced AAA stamps on mortgage securities that turned out to be garbage. It is also, almost exactly, the arrangement under which your favorite AI model got its benchmark scores, your vendors got their SOC 2 reports, and the apps on your phone got their store rankings. Once you learn to see it, you cannot unsee it, and the design question for every trust score you rely on collapses to five words: who pays, and can they shop?

Who pays the rater?

It was not always this way for credit ratings. Through the middle of the twentieth century, the agencies ran on a subscriber-pays model: investors bought the ratings, the way you buy a newspaper, and the agency's customer was the person relying on the grade. That alignment is the healthy one — the rater serves the reader.

What broke it was, of all things, the photocopier. By the early 1970s, cheap Xerox machines made it impossible to keep a rating inside the circle of paying subscribers; one buyer could copy the report and pass it around for free. The free-rider problem hollowed out the subscriber model, and Moody's, S&P, and Fitch all flipped to issuer-pays: the company issuing the bond would pay the agency to rate it. A mundane technology shift — copying got cheap — quietly installed a structural conflict of interest that would help crater the global economy thirty-eight years later. Keep that causal chain in mind; it has a direct modern echo.

The conflict is easy to state and hard to escape. When the issuer pays and can choose among three agencies, each agency knows that a rating the issuer dislikes sends the fee to a competitor. As the plainest summary of the mechanism puts it, an agency "might shade its rating upward so as to keep the issuer happy and forestall the issuer's taking its business to a different rating agency." Nobody has to be corrupt. The incentive does the work on its own, one reasonable-seeming decision at a time.

Ninety percent of AAA became junk

Then the structured-finance boom turned the incentive into a firehose. Rating a plain corporate bond was modest work for a modest fee. Rating a collateralized debt obligation — a tower of sliced and repackaged mortgages — paid roughly three times as much, with structuring fees running $300,000 to $500,000 and as high as $1 million per vehicle. The agencies were not bystanders to the mortgage machine; they were paid handsomely to bless it, by the very banks assembling it, who could shop the deal to whichever agency was most accommodating.

The output was a flood of top grades. More than half of all the structured-finance securities Moody's rated carried a AAA — the same grade reserved for the safest sovereigns and bluest blue chips. Seventy-one percent of CDO issuance was rated AAA. And then reality arrived. Over 90% of the AAA-rated mortgage-backed securities from 2006 and 2007 were downgraded to junk by 2008. In the first three quarters of 2008 alone, 11,327 downgrade actions — about 31% of them — hit tranches that had been stamped AAA. The Commission's verdict was blunt: the agencies were "essential cogs in the wheel of financial destruction" and "key enablers of the financial meltdown." A AAA from a shopped, issuer-paid agency in 2006 carried, in the end, almost no information. It was a number everyone could hit, which is the same as a number that means nothing.

It's worth being honest about the obvious fix, because it's only half a fix. Surely investor-pays is cleaner? The evidence is genuinely mixed. Egan-Jones, an investor-paid agency, has been compared head-to-head with the issuer-paid giants, and it does behave better in the way that matters most for safety: it issues more conservative ratings to low-quality firms, and its downgrades move stock prices more sharply, suggesting the market reads them as carrying real information. But investor-paid agencies have their own biases — toward the preferences of the institutional investors who fund them. The conflict doesn't vanish when you change who pays; it rotates. "Please the issuer" becomes "please the big investor." The honest framing isn't "find the model with no conflict." It's "know which way your rater is bent, and decide whether that bias is the less dangerous one for your purpose."

Your leaderboard has the same structure

Now look at the scoreboard that ranks AI models, and run the same five-word test.

Who pays the rater? The labs being scored on the benchmarks are not passive subjects. They help design the benchmarks — MMLU was co-authored with researchers who sit inside the industry; GPQA and others involve lab-affiliated authors. They train on benchmark-adjacent data, sometimes the benchmarks themselves, so the model recalls answers rather than reasons to them. They select which benchmarks to report, publishing the flattering numbers and quietly dropping the rest. And they optimize directly for the scores, which is Goodhart's Law stated as a product roadmap: when a measure becomes a target, it stops being a good measure. No one has to cheat. The incentive does the work — exactly as it did at Moody's.

The "shopping" looks a little different but rhymes precisely. A bank threatened, "next time we'll go with Fitch and S&P." A lab doesn't need to threaten anyone; it simply promotes the benchmark on which it leads and ignores the ones where it trails. The competitive pressure to lead some leaderboard selects, across the whole field, for benchmarks that can be led — which is to say, benchmarks that can be gamed.

And the tell is the same tell. In credit, the warning sign was grade inflation until AAA stopped discriminating. In benchmarks, it's saturation. GPQA Diamond — a deliberately hard graduate-level science benchmark — ran from about 39% for frontier models in late 2023 to 94% and up by 2025–2026. When every serious model clusters against the ceiling, the benchmark has the same information content a AAA rating had in 2006: nearly none. Contamination accelerates it; older benchmarks like HumanEval, MMLU, and GSM8K are especially compromised because their questions have leaked into training corpora. If 90% of "state-of-the-art" scores are inflated by contamination and optimization, then "SOTA on benchmark X" deserves precisely the trust you'd extend to a 2006 AAA.

Credit ratings	Benchmark leaderboards
Issuer pays the agency	Lab designs and trains on the benchmark
Shop for the agency with the best grade	Report only the favorable benchmarks
"We'll go to Fitch and S&P"	"We'll promote a different benchmark"
90% of AAA tranches downgraded to junk	Scores saturate; stop discriminating
3× revenue per structured product	Benchmark wins drive enterprise adoption
Independent check: investor-paid agencies	Independent check: community evals (HELM, lm-eval)

It's everywhere once you look

The pattern is not special to ratings and benchmarks. It is the default failure mode of any trust score, and it hides in places you've been treating as settled.

SOC 2 audits. The company being audited pays the auditor and chooses which firm to hire. That is issuer-pays with a compliance logo on it. Companies can and do gravitate toward auditors known to be smooth, and a SOC 2 report is, structurally, a trust rating funded by the party it reassures you about.

App store rankings. Developers pour money into paid installs, review manipulation, and keyword stuffing to move their position — the rated party investing directly in inflating the rating. And the storefront playing "agency" earns its cut from those same developers, which is not a neutral seat.

Vendor-run bug bounties. When a company triages its own bounty program, the entity that would be embarrassed by a serious vulnerability is the same entity deciding which reports are "valid," what severity they get, and whether they're disclosed. The rated party owns the rating. Researchers have war stories about exactly this — high-impact findings quietly downgraded to keep the score, and the payout, low.

In every case the question that cuts through is identical: who funds the scorer, and can the scored party shop or self-select? When the answers are "the scored party" and "yes," discount the score accordingly — not because anyone is necessarily acting in bad faith, but because you would be betting against the incentive, and the incentive is patient.

The fix is boring, and it's old

The good news is that nobody has to invent the remedy. Finance already wrote it down, even if it never fully adopted it. The structural cure for rating-shopping is to break the link between the rated party and its choice of rater. After 2008, Dodd-Frank contemplated a clearinghouse or assignment model: a neutral body assigns a rating agency to an issuer, so the issuer can't shop and the agency isn't auditioning for the next fee. It was only partially implemented — the politics were hard, because the people who benefit from a broken system rarely volunteer to fix it — but the design is sound and the principle is portable.

Port it. For benchmarks, the equivalent of the clearinghouse is a neutral party that decides what the model is tested on, rather than the lab choosing its own exam — and, critically, a held-out test set the labs cannot train on or even see. Community-run evaluation efforts like HELM and the open lm-eval-harness move in this direction precisely because they aren't funded or curated by the labs being ranked; treat them as your investor-paid agency, the independent check on the issuer-paid score. For audits, the lever is mandatory rotation and, better, an assignment body that picks your auditor for you. For bug bounties, it's third-party triage — a platform whose paycheck doesn't depend on the vendor's vulnerability count looking small.

None of this is glamorous, and none of it fully eliminates bias, because — remember the investor-pays lesson — bias rotates rather than disappears. The realistic goal is not a conflict-free score. It is to move the conflict somewhere less dangerous to you specifically, and to know which way the remaining bias bends.

Who pays, and can they shop?

So here is the practical habit, and it costs nothing but attention. The next time someone hands you a trust score — a benchmark ranking, a SOC 2 attestation, a five-star app, a "no critical vulnerabilities" report — don't start by arguing about the number. Start by asking the two questions Gary Witt's testimony answered by accident. Who pays the scorer? And can the scored party shop for a better score or pick which score to show you? If the answers are "the party being scored" and "yes," you are looking at a 2006 AAA, and you should reach for the independent check — the investor-paid agency, the community eval, the third-party auditor — before you reach for your wallet or your trust.

The agencies didn't fail in 2008 because their analysts were stupid. They failed because the structure paid them to shade upward and shop was always one phone call away. That structure has quietly propagated into how we measure software, vendors, and now intelligence itself. The cure isn't smarter raters. It's changing who signs their checks — and until that changes, the most useful thing you can do with any leaderboard is to ask who paid to be on top of it.

A reputation the rated party can't pay for, shop, or self-select.

Every fix in this essay is the same move: make the scorer independent of the scored. For agents, that means reputation that isn't a self-reported benchmark or a vendor's own claim — it's earned from what an agent actually did, attested by the counterparties it did it with, and portable across the places that rely on it. The Agent Rating Protocol is that independent check: a who-pays-can't-shop reputation layer, so an agent's track record is the investor-paid agency, not the 2006 AAA.

pip install agent-rating-protocol · npm install agent-rating-protocol
vibeagentmaking.com → · See it in action

← Back to all posts