Our Quality Scores Were Precise, Useless, and Identical

A 100-point wine scale where nothing scores below 80. Credit ratings that couldn’t distinguish Treasury bonds from subprime mortgage pools. Performance reviews where everyone “meets expectations.” The same mechanism, in every domain, every time.

Published April 2026 · 12 min read

In the summer of 1978, Robert Parker Jr., a lawyer from Monkton, Maryland, published the first issue of The Wine Advocate from his basement. His innovation was a 100-point scale — borrowed from the American school grading system — that promised to do for wine what Consumer Reports had done for toasters: make quality legible to outsiders. No more mystifying French terminology. No more trusting the sommelier. Just a number.

By the early 2000s, the scale had conquered the wine industry. A Parker score could move prices overnight. But something had happened to the numbers. In practice, no wine Parker reviewed scored below 80. Most fell between 87 and 95. The 100-point scale had become an 8-point window doing all the economic work — setting auction prices, guiding distributor purchases, shaping what grapes got planted across three continents.

This isn’t a story about wine. It’s about what happens to every scoring system, in every domain, once the scores start to matter.

The Publication Filter

Parker’s scale was technically 50–100, with 50 representing an unacceptable wine. But unacceptable wines never made it into The Wine Advocate. Why would they? Parker chose which wines to review. By the time a number appeared in print, it had already passed a filter: someone thought this wine was worth evaluating.

That filter compressed the scale from both ends. From below, because wines unlikely to score well weren’t reviewed. From above, because a perfect 100 was reserved for transcendent experiences that came along a few times per decade. The effective range — the band where real differentiation happened — narrowed to roughly 85–98.

Within that band, the economic consequences were wildly asymmetric. A wine scoring 89 might retail in one price tier; the same wine scoring 91 — two points higher on a supposedly information-poor scale — could jump 20–30% at auction. In the secondary market for Bordeaux and Napa cult wines, the gap between 95 and 96 meant thousands of dollars per case. Two points, on a scale where sixty of the hundred points went unused.

Parker wasn’t the only one. James Suckling’s scores compressed to a similar band. Jancis Robinson, who deliberately used a 20-point scale to resist this effect, found her effective range was about 14 to 19 — a 5-point window. The scale width didn’t matter. The compression was coming from somewhere deeper than the number of available points.

The AAA Ceiling

The wine industry’s compressed scores raised prices and lowered trust in critics. The credit rating industry’s compressed scores helped crash the global economy.

By 2006, Moody’s, Standard & Poor’s, and Fitch had assigned their highest rating — AAA, denoting negligible credit risk — to thousands of collateralized debt obligations built from subprime mortgages. These structured instruments shared a rating tier with U.S. Treasury bonds. The distance between “the full faith and credit of the United States government” and “a pool of adjustable-rate mortgages issued to borrowers with limited documentation” had been compressed to zero.

The mechanism was straightforward, even if the consequences weren’t. Under the issuer-pays model, the banks assembling the CDOs paid the agencies to rate them. An analyst who assigned a lower rating risked losing the client — the bank would take its business to a more agreeable agency. The social cost of a low score was revenue loss. The social cost of a high score was nothing — at least, nothing that arrived within the analyst’s review cycle.

The Financial Crisis Inquiry Commission documented the result in its 2011 report. More than 90% of AAA-rated mortgage-backed securities issued in 2006 and 2007 were eventually downgraded, many to junk status. The ratings hadn’t measured default probability. They had measured familiarity — how closely a new instrument resembled the structures that had been approved before.

The compression pointed in the direction of least resistance: upward, toward the client’s preferred outcome. Identical to the wine scale’s dynamics. Just with different consequences when the floor gave way.

The Structural Pattern

Psychometrics has a term for part of this: central tendency bias — the tendency of raters to avoid extremes. But central tendency bias is a cognitive explanation for what is usually an incentive problem. The wine critic, the credit analyst, and the manager writing a performance review aren’t compressing their scales because their brains default to the middle. They’re compressing because the cost of differentiation is real and the cost of consensus is hidden.

Consider the incentive structure in each case.

The wine critic who gives a 78 to a prominent estate risks losing access to future tastings. The credit analyst who downgrades a client’s product risks losing the revenue. The manager who rates an employee “needs improvement” earns a difficult conversation, potential legal exposure, and a demoralized team member — even when the rating is accurate.

Now consider the cost of compression. The wine critic who gives an 89 instead of a 78 loses nothing. Not immediately. The reputational cost of grade inflation arrives slowly, diffused across the industry, shared by everyone. The credit analyst who gives an AAA loses nothing until the market corrects. The manager who gives everyone “meets expectations” avoids all the short-term costs and shares the long-term ones — attrition of high performers who feel invisible, retention of low performers who feel safe — with every other manager doing the same thing.

This is the structural insight: a scoring system eventually measures the cost of disagreement, not the quality of the thing being scored.

The compressed band isn’t noise. It’s signal — about the scorer, not the scored. And the direction of compression tells you where the social pressure points. Wine scores compress upward because generosity costs less than honesty. Performance reviews compress upward for the same reason. Academic grades compress upward — a phenomenon visible in reports that Harvard’s median grade has drifted toward A-minus. Hotel reviews on platforms follow a J-shaped distribution, most properties clustered between 4 and 5 out of 5, because guests who had a 2-star experience are less likely to review at all.

The shape of score compression is a map of social pressure. Read the map, and you know what force is acting on the scorer — even when the scorer doesn’t.

The Antidotes

Not every scoring system collapses into a polite consensus band. The ones that resist compression share a design principle: they change what’s being measured rather than stretching the scale.

The most commercially successful example is the Net Promoter Score, introduced by Fred Reichheld in a 2003 Harvard Business Review article, “The One Number You Need to Grow.” Instead of asking customers to rate their satisfaction on a 10-point scale — which compressed, predictably, into a 7-to-9 band — NPS asks: How likely are you to recommend this product to a friend or colleague? The question is behavioral rather than evaluative. Recommending is a social act with reputational stakes. You’re not reporting how you feel; you’re predicting what you’d do. That shift in framing produces a distribution with actual spread.

Chess offers a more elegant structural solution. The Elo rating system, developed by Arpad Elo for the United States Chess Federation in the 1960s and adopted by FIDE in 1970, doesn’t ask anyone to rate anything. Every match updates both players’ ratings based on the outcome relative to prediction. No judge’s comfort zone matters. No one has a client to protect. The system measures revealed performance, not stated evaluation. The scale doesn’t compress because there’s no human assessor to compress it.

The pattern across these antidotes: you don’t fix a compressed scale by making it wider. You fix it by asking a question the scorer can’t compress without lying to themselves.

Where the Analogy Breaks

Three limits, ordered by how much they matter.

First, the stakes are not interchangeable. Credit rating compression contributed to a global financial crisis. Wine score compression shifts auction prices and planting decisions. Performance review compression causes individual career harm. The same mechanism at different scales produces wildly different consequences, and treating them as equivalent risks trivializing the catastrophic case.

Second, reversibility varies enormously. A startup can redesign its internal review rubric in a quarter. Redesigning credit rating methodology requires regulatory coordination across jurisdictions, shifts in business model, and decades of institutional inertia. The diagnosis travels well; the fix does not.

Third, not all compression is dysfunction. Some scoring systems should be stable. You don’t want credit ratings bouncing quarterly in response to noise — some smoothing is a feature, not a failure. The challenge is distinguishing stabilizing compression (dampening volatility) from consensus compression (dampening information). They look identical in the data until the moment they don’t.

The Monday-Morning Version

If you manage a team, run a review process, or rely on scores to make decisions, here’s the diagnostic.

Calculate your effective range. Take the highest and lowest scores across your last twenty-or-so evaluations. If the effective range is less than 20% of the theoretical range — say, your 1-to-10 scale actually produces scores between 7 and 9 — your system is compressed.

Identify the pressure direction. Scores compressing upward means the social cost of going low exceeds the cost of going high. Scores compressing toward the center means the cost of being the outlier in either direction is high. The direction tells you which force to address.

Change the question before you change the scale. Calibration training helps, but temporarily. Expanding from 5 points to 10 gives people more unused numbers to avoid — nothing else. Instead, replace evaluative questions (“How good is this?”) with behavioral ones (“Would you ship this to a customer today?”) or comparative ones (“Rank these three deliverables from strongest to weakest”). Questions that require differentiation produce differentiation.

Make rubric criteria resist compression. “Quality of work” is compressible because quality is subjective. “Number of production incidents introduced” is not. “Exceeded expectations” compresses toward the generous; “shipped on the committed date” does not. The more a criterion describes an observable event rather than an assessed impression, the harder it is to compress.

And if you find yourself defending a scoring system on the grounds that it’s precise — that the numbers are consistent, reproducible, carefully calibrated — ask one question first: precise about what? A thermometer that reads 72°F in every room of a building is precise. It’s also broken.

Robert Parker retired from The Wine Advocate in 2019, forty-one years after he published that first issue from his basement. The 100-point scale survived him. It still runs the secondary wine market. And the scores still cluster in that narrow band — a few points doing the work of a hundred.

The scale works. It just doesn’t measure what its creator intended. It measures something more honest: how much a critic is willing to stake on the claim that this bottle is different from that one.

That gap — between what a score promises to measure and what it actually measures — is worth understanding. Not just in wine. In every system where someone writes down a number and someone else makes a decision because of it.

The number is always real. The question is what it’s a number of.

Scores compress because evaluative questions invite compression. What if the question were structural instead?

Agent Rating Protocol takes the essay’s own prescription and applies it to agent trust. Instead of asking “How trustworthy is this agent?” — an evaluative question that compresses into a polite consensus band — ARP anchors every claim to a signed record. Each record names the specific judgment applied, the evidence behind it, and the downstream decisions that inherit from it. No assessor comfort zone. No compression band. Just verifiable performance — the Elo approach, applied to agent decisions.

Verify an agent’s decision chain · Follow a claim through its evidence · pip install agent-rating-protocol

← Back to all posts