The Dunning-Kruger Tax on Cheap LLMs

In April 2026, ten different US courts sanctioned ten different lawyers for filing AI-generated briefs with fabricated citations. The cumulative penalty across those ten cases was $43,515 — and that was the small end (Damien Charlotin’s running database of AI hallucination cases, fetched 2026-05-01). Earlier the same month, a Nebraska judge handed Greg Lake an indefinite license suspension — the first US license suspension tied to AI hallucinations — after 57 of 63 citations in his filings were flagged, including 20 confirmed hallucinations and 3 entirely fabricated cases (ComplianceHub.wiki Q1 2026 review). An Oregon federal court issued a record $110,000 sanction in Valley View Winery on April 4, 2026, for 23 fabricated citations and 8 false quotes, then noted with apparent astonishment that the attorney had quietly deleted the errors and resubmitted without disclosure (same source).

What the cases share is not lawyer incompetence. They share a model behavior. None of the AI tools — including premium products like LexisNexis Lexis+, Thomson Reuters CoCounsel, and vLex — flagged their own uncertainty. Every fabricated citation arrived with the same formatting, the same authoritative tone, the same confidence as a real one. The lawyers trusted the output because the output gave them no reason not to.

That is not a quirk of legal AI. It is a tax. And almost everyone who deploys an LLM is paying it.

The signal does not just fail to track accuracy. It actively pulls the user toward the wrong answer.

The Ladder, and Why “Cheap” Means “Confidently Wrong”

In 2026, two researchers at Cognizant — Ghosh and Panday — published the most systematic study of LLM confidence calibration to date (arXiv:2603.09985). They ran 24,000 trials across four frontier models on MMLU, ARC, HellaSwag, and TriviaQA. For each question, the model was asked to answer and then rate its own confidence on a 0-to-100 scale.

The results are not a curve. They are a cliff.

Model	Accuracy	Mean confidence	Expected Calibration Error
Claude Haiku 4.5	75.4%	86.0%	0.122 (best)
Gemini 2.5 Pro	80.9%	99.5%	0.185
Gemini 2.5 Flash	70.9%	97.9%	0.272
Kimi K2	23.3%	95.7%	0.726 (worst)

Kimi K2 — the budget-tier model from Chinese frontier lab Moonshot AI — got the answer right 23.3% of the time and reported 95.7% average confidence. On TriviaQA the spread was 3.9% accuracy at 97.9% mean confidence, an Expected Calibration Error of 0.940. The model expressed near-total certainty about answers it got wrong 96 times out of 100.

Claude Haiku 4.5 closed the confidence-accuracy gap to about ten points and posted the only example of underconfidence in the entire study: on HellaSwag it averaged 74.0% confidence against 82.9% accuracy. It was more accurate than it claimed to be.

The pattern repeats outside this study. Qazi, Khan, Ghani and colleagues published “Large language models show Dunning-Kruger-like effects in multilingual fact-checking” in Nature Scientific Reports (2026, vol. 16, article 7594; arXiv:2509.08803). They tested nine models on 5,000 claims previously assessed by 174 fact-checking organizations across 47 languages. Llama-7B answered with up to 88% certainty at roughly 60% selective accuracy. GPT-4o reached 89% selective accuracy with certainty under 40%. The smaller, cheaper models said more — and said it more confidently — about the things they got wrong. The bigger models said less, and said it more carefully, about the things they got right.

It is the structural mirror of Kruger and Dunning’s 1999 finding in humans: bottom-quartile performers on logical-reasoning tests rated their performance in the 62nd percentile, while top-quartile performers underestimated themselves by 13 to 15 points. The metacognition has not changed. It just runs on GPUs now.

Why a Confident Error Costs More Than an Uncertain One

If overconfidence only meant pixel-level wrongness, you could route around it. The problem is asymmetric.

A model that says “I’m not sure, but...” triggers human review. The error gets caught, regenerated, corrected. That costs minutes.

A model that says “The answer is X” with no hedge — Kimi K2’s default mode — slides past review. It enters the system. It gets quoted in the brief, baked into a pricing decision, fed forward into the next agent. By the time it surfaces, the unwinding takes hours, days, or, in the legal cases, six-figure sanctions.

Three independent lines of evidence make this concrete. SQ Magazine’s 2026 hallucination survey reports that 62% of users trust AI outputs without verification on first contact. MIT research from earlier the same year found that LLMs use confident language (“definitely,” “without doubt”) 34% more often when generating incorrect information than when generating correct information. And the AA-Omniscience benchmark (November 2025) tested 40 models on knowledge tasks: 36 of the 40 — exactly nine in ten — were more likely to give a confident wrong answer than to say “I don’t know.”

Stack those findings: the user is calibrated to trust confidence; the model produces more of it precisely when wrong; nine in ten models prefer confident wrongness to honest uncertainty. Confidence and incorrectness are positively correlated. The signal does not just fail to track accuracy — it actively pulls the user toward the wrong answer.

Why Alignment Made It Worse

The intuitive fix is to assume bigger models calibrate themselves. That assumption is wrong twice over.

First, Tan and colleagues (BaseCal, January 2026) showed that base language models — before any helpful-assistant fine-tuning — are reasonably well calibrated. They learn probability distributions through maximum-likelihood estimation, and their token probabilities mostly track frequency in the training corpus. Then alignment training happens. RLHF, DPO, and PPO reward outputs that human raters labeled as helpful and authoritative. Those outputs systematically express more confidence than the base model would have. The pipeline that makes models pleasant to talk to is the pipeline that destroys their calibration.

Second, scale alone does not save you. A 2026 study published in Nature npj Gut and Liver tested 48 LLMs across 8 model families on 300 gastroenterology board-exam questions. Only 5 of 48 — about 10% — produced calibration meaningfully better than random. Even o1-preview, the best performer, scored an AUROC of roughly 0.6 — a hair above the 0.5 baseline. Across model generations, accuracy improved; calibration did not. The study’s conclusion is blunt: “LLMs cannot be relied upon to communicate uncertainty, and human oversight remains essential for safe use.”

Human oversight is the tax. It is the cost of compensating for a model that will not tell you when it does not know.

The TCO Blind Spot

Pricing pages do not show calibration error. They show dollars per million tokens. As of May 2026, that range stretches from Xiaomi MiMo-V2-Flash at $0.09 per million input tokens on the cheap end to OpenAI GPT-5.2-Pro at $21.00 at the top — a 233x range on input tokens, and a 579x range on output (CostGoat LLM API pricing comparison, May 2026). For an organization processing a billion tokens a month, the gap is $90 versus $21,000.

Standard total-cost-of-ownership analyses fold in infrastructure overhead. Industry analyses estimate enterprise LLMOps multipliers of 2.3x to 4.1x raw API spend (theneildave.in, 2026): guardrails consume 10–30% in token overhead, retries add 5–15%, prompt maintenance and evaluation infrastructure tack on quarterly costs. Sixty-eight percent of teams underestimate their first-year LLM bill by more than three times.

But that math counts tokens, not confidence. The hidden line item is what I’ll call the Dunning-Kruger Tax:

DK_Tax ≈ (ECE_cheap − ECE_premium) × N_queries × C_correction

where ECE is Expected Calibration Error of each model, N is queries processed, and C_correction is the average downstream cost per uncaught confident error — review time, rework, refunded contracts, sanctions, suspended licenses. Plug Ghosh and Panday’s data into a million-query workflow at a modest $10 average correction cost: the calibration gap between Kimi K2 (0.726) and Claude Haiku 4.5 (0.122) is 0.604, which yields about $60,400 of expected hidden cost per million queries. That figure can match or exceed the API savings the cheaper model produced.

The COREA paper (arXiv:2603.03752, March 2026) tests this in a controlled setting. The researchers compared calibrated versus uncalibrated routing of queries between small and large LLMs. Calibrated routing produced 16.8–21.5% cost savings with only 1.5–1.7 percentage points of accuracy loss — a clean win. Uncalibrated routing, using the same small model without confidence calibration, saved 82% of token cost — and lost 11.4 percentage points of accuracy. That accuracy loss is the DK Tax measured under laboratory conditions: the “savings” generated more errors than the tokens were worth.

Where the Analogy Breaks

The Dunning-Kruger parallel is useful, but it is borrowed, not derived. Three places it strains.

The original DK effect itself is contested. Gignac and Zajenkowski (2020) and Nuhfer et al. (2017) argue the human pattern is partly a statistical artifact of regression to the mean, not a clean cognitive bias. That debate doesn’t invalidate the LLM data — model calibration is a direct measurement of confidence versus correctness, not a noisy self-assessment by a confused subject — but the metaphor is a name we’re hanging on a different mechanism.

The simple “cheap equals bad confidence” story has counterexamples. Gemini 2.5 Pro is the most accurate model in the Ghosh and Panday study (80.9%) and yet has near-zero confidence-accuracy correlation (Pearson r = 0.011, p = 0.406, not statistically significant). Its mean confidence is 99.5% with almost no variance — confidence is a constant, not a signal. That is decorative confidence wrapped around mostly-correct answers, so the user pays a smaller tax. The defensible claim is sharper: cheap models combine bad confidence with low accuracy, which is the worst-case combination.

And the fix is real. MIT CSAIL’s RLCR method (arXiv:2410.09724) folds Brier-score penalties into the reward function during training and reports up to 90% reductions in calibration error without sacrificing accuracy. Conformal prediction, post-hoc Platt scaling (which dropped a medical vision-language model’s calibration error from 0.419 to 0.035 in arXiv:2604.02543), and confidence-aware routing all work. The DK Tax is a choice the industry is making, not an inevitability. Budget-tier providers don’t currently invest in calibration-aware training, and most buyers don’t know calibration is a dimension to ask about. The tax exists in the gap between what we know how to fix and what we are paying for.

What to Actually Do

For developers and tech leads who deploy LLMs, four moves push back on the tax.

Treat per-token price as one variable in a multivariate cost equation. Add Expected Calibration Error to your model evaluation alongside accuracy, latency, and cost. If your provider doesn’t publish ECE on benchmarks similar to your workload, run it yourself: a 200-question evaluation set with confidence ratings is enough to get a usable estimate, and the script fits in fewer lines than your usual eval rig. Until you measure ECE, the cheap-model “savings” line in your TCO sheet is a guess.

Don’t trust the model’s self-reported confidence to trigger review. The verbalized-confidence channel is broken across model tiers — a 2026 survey across GPT-3.5, Vicuna, and others reported ECEs above 0.377 for verbalized confidence, and GPT-4 hit only 62.7% AUROC at discriminating its own correct answers from wrong ones (Beancount.io confidence-calibration survey, 2026). Token probabilities are also miscalibrated (ECE 0.191–0.336 across nine reasoning models in arXiv:2604.19444, April 2026). Build verification on external signals: cross-check citations against authoritative databases with HEAD-request validation, run schema validation at the API boundary instead of coercing the model’s output, route uncertain-by-content queries (long output, novel domain, high-stakes decision) to a stronger model regardless of what the budget model claims about itself.

Never let one LLM verify another LLM’s output as the only safeguard. The Raja Rajan case (April 27, 2026) is the cautionary tale: he used one AI to draft his brief, then a different AI to “verify” the citations. Both were confidently wrong, the second offense in his record, sanctioned $5,000. The arXiv:2508.06225 paper on LLM-as-a-Judge documents the same pattern at benchmark scale — judge models exhibit systematic overconfidence in their evaluations. Two miscalibrated models do not produce a calibrated result; they produce a more elaborate hallucination.

Push your providers on calibration explicitly. Ask: was this model trained with a calibration objective? What is its ECE on benchmarks comparable to your workload? Anthropic’s Claude Haiku 4.5 calibration advantage in the Ghosh and Panday study is, per the authors’ explanation, a likely reflection of training methodology that emphasizes honest uncertainty expression. That is a design choice. The market will start rewarding it as buyers learn to ask the question — and the question forces providers to either show their numbers or admit they don’t measure them.

The Receipts

The blunt insight: choosing a model tier is also choosing a calibration regime. The buyers most pressured to choose cheap are the buyers least equipped to absorb the verification burden the cheap model offloads onto them. That is what makes the DK Tax regressive. It is also what makes it tractable, because the moment you recognize calibration as a cost dimension, you can negotiate it like any other cost dimension.

The court dockets, the abandoned enterprise pilots, the IBM Watson for Oncology write-down (~$3 billion in losses, MD Anderson alone spending $62 million before walking away), the 47% of executives in SQ Magazine’s 2026 survey who reported making major decisions on unverified AI output — these are not noise. They are receipts. The model said it knew. It did not. Someone paid.

You can pay a different way.

Verify on External Signals, Not on the Model’s Own Confidence.

The essay’s second move — never trust verbalized confidence to trigger review — is the same architectural split the Agent Trust Stack ships as composable libraries: signed claims of what an agent did (Chain of Consciousness), portable rebuttable ratings layered on top of those claims (Agent Rating Protocol), and a discipline that ratings are inputs to human judgment, never autopilot for it. The point is to put verification on something the model cannot rewrite by sounding sure of itself.

pip install agent-trust-stack npm install agent-trust-stack

For the provenance layer specifically — the signed-action chain that gives a rating something to point at — Hosted Chain of Consciousness ships it as a service. Confidence is the failure mode. External signals are the answer.