Bias is when the dart lands consistently left of the bullseye. Noise is when the darts land all over the board. One of these has lawyers. The other has your performance review.
In a French courtroom a few years ago, a judge handed down a lighter sentence than the case warranted. The defendant had no idea why. The defendant did know one thing about the day, though: it was their birthday.
This is not folklore. In their 2021 book Noise, Daniel Kahneman, Olivier Sibony, and Cass Sunstein pulled out the data: French court judges showed measurable leniency to defendants whose hearings fell on the defendant’s birthday. Not the judge’s birthday. The defendant’s. A fact with zero relevance to guilt, evidence, or law shifted years of someone’s life.
If that bothers you, here is the worse news: it almost certainly happened to you last week. You got your performance review, and roughly three-quarters of the rating had nothing to do with your performance. Olivier Sibony, in a 2021 McKinsey interview promoting the book, summed up the research bluntly: only about a quarter of a typical performance rating tracks actual performance. The rest is the rater’s mood, the rater’s idiosyncratic preferences, the order they reviewed people in, the weather outside the window.
We have built sophisticated machinery to detect bias in our decision systems — racial bias in lending, gender bias in hiring, cognitive bias in forecasting. We have done almost nothing about the larger error source sitting next to it. It has a name now, finally, but most of the institutions paying it still do not use the word.
The name is noise. And it is the hidden tax on every decision system you participate in.
Bias is when the dart consistently lands left of the bullseye. Noise is when the darts land all over the board. Statistically: total error decomposes as Mean Squared Error = Bias² + Noise². For most professional decisions Kahneman’s team measured, the noise term dominated.
There are three flavors. Level noise is when one judge is consistently harsh and another consistently lenient — the same case gets very different verdicts depending on whose courtroom it lands in. Pattern noise is when judges rank cases differently — one is harder on white-collar crime, another harder on street crime. Occasion noise is when the same judge gives the same case different verdicts on different days, depending on what they ate, who won the football game last night, and apparently whether the defendant’s birthday cake was on the table.
Hold that taxonomy. Most of the rest of this essay is about how every part of modern life is dripping in all three.
In a now-famous study, Kahneman and Sibony’s team got an insurance company to run a noise audit. The company’s executives expected underwriters reviewing identical case files to disagree by about 10%. Underwriters, after all, are trained, certified, and follow detailed procedures. The actual number was 55%. If the median premium quote was $9,500, two random underwriters reviewing the same case might come back at $9,500 and $14,725. The insurance company had no idea.
A separate audit at an investment firm found 44% variance between analysts evaluating the same company. Asylum cases in U.S. immigration courts, per TRAC at Syracuse, swing from a 1.3% denial rate with one judge to a 100% denial rate with another in the same building. Wine — which an entire industry exists to evaluate — is essentially scored at random: Robert Hodgson’s 2008 study showed that gold medals at one major California competition were statistically independent of medals the same wine won at another. The medal on the bottle is, mathematically, signal-free.
Forensic fingerprint examiners in the FBI–Noblis study correctly matched prints in only 62.6% of trials. Pathologists agree on a diagnosis about 62% of the time; psychiatrists, 50%. Software engineers, asked to estimate the same task twice on different days, give answers that differ by an average of 71% from themselves.
Now restate each of those as a tax. If your insurance premium were a restaurant bill, noise would be a 55% tip you didn’t know you were paying — sometimes positive, sometimes negative, never itemized. If you got a software estimate of three weeks, the same engineer would have given you five and a half on a different day. If you went to the doctor with chest pain, the question of whether you get a cardiac workup or a Xanax prescription depends on which physician was on shift and what kind of morning they had. The mean answer is fine. The individual answer is a coin flip.
This is the hidden surcharge. It is structural, ongoing, and built into the price of everything.
Bias is visible because you can spot the pattern. If lending denials cluster by zip code, the disparity shows up in aggregate data and someone files a lawsuit. Noise has no pattern. It’s just more variance than there should be. And critically: each individual case goes to one decider. You don’t know what the other underwriter would have quoted. You don’t know what the other judge would have given. You see your single answer and assume it was the answer.
The institutions hurt by noise don’t even have stable vocabulary for it. A 2025 scoping review in BMC Medical Informatics and Decision Making — titled, with unintended brutal honesty, “Noise is an underrecognized problem in medical decision making and is known by other names” — found that medical literature hides noise under “inter-rater reliability,” “intra-rater reliability,” “random variability,” and “practice variation.” Of fourteen studies meeting the inclusion criteria, seven demonstrated pattern noise and three demonstrated occasion noise, but none of the underlying papers used the unified word. You cannot fix what you cannot name. The variance remains, fragmented across journal silos.
The same year, the British Psychological Society ran a piece titled “Psychology needs a noise revolution.” An entire field dedicated to studying the human mind had spent a century focused on the signal — mean effects, headline findings, big-N replications — and treating the variance as nuisance to be averaged away. The rest of us were never going to be ahead of the people who study cognition for a living.
Step back from the professionals for a moment. Consider the children.
Casey Family Programs, working from Milwaukee County data, tracked what happened to children in the child welfare system based on how many caseworkers handled their case. A child with one consistent caseworker reached a permanent placement 74.5% of the time. With two caseworkers, 17.5%. With six or seven caseworkers, 0.1%.
That is not a typo. One in a thousand.
Each caseworker handoff introduces a new decision-maker with new pattern noise — a different read on the file, a different sense of which factors matter, a different threshold for action. The child’s life becomes a game of telephone. By the seventh telling, the original signal is gone. The variance compounds because it is serial: each handoff adds noise to whatever noise already entered the system, and the institution running the system has no instrument that measures the compounded total.
This is what the “tax” metaphor is really about. In foster care, noise isn’t a margin error on a spreadsheet. It’s a child who never gets adopted because too many people, none of them malicious, none of them incompetent, each made slightly different judgments about a file that was slightly different from what the previous caseworker wrote down. The variability isn’t between professionals as a population. It’s between sequential readings of the same case, and there is no audit trail because each reading was, individually, defensible.
If this doesn’t make you want to redesign every handoff in your organization, nothing will.
Here is the obvious response: replace the humans with an algorithm. Algorithms have no birthdays, no moods, no breakfast. The same input produces the same output, every time. Occasion noise drops to zero by construction.
This is true. It is also where most engineering teams stop reading.
The full picture is uglier. When a human underwriter is wrong, they are idiosyncratically wrong, and the next underwriter probably gets it right. When the algorithm is wrong, it is wrong for everyone simultaneously and for the same reason. You haven’t eliminated the error — you have correlated it. Distributed, varied human noise becomes systematic, monoculture failure. The fingerprint-examiner data matters here: those FBI–Noblis examiners disagreed with each other and with themselves, but they also caught different mistakes. Replace them with a single model and a class of errors becomes universal across every population the model touches.
There is a second issue, freshly relevant in 2025. Large language models are themselves noisy decision-makers. Run the same prompt twice and you’ll get different completions. Run the same evaluation rubric across a model’s outputs and the rubric scores its own outputs inconsistently. The Kahneman-Sibony framework applies recursively to the systems we are now building to escape it. We are not, with current architectures, replacing noisy human judgment with deterministic machinery. We are replacing noisy human judgment with noisy machine judgment that fails in correlated ways.
Algorithms are a tool, not an absolution. The right question isn’t “human or model?” It’s “what does this decision’s noise distribution actually look like, and which substrate gives me a noise distribution I can live with?”
In May 2025, Perspectives on Psychological Science published a paper by Adam Sanborn at Warwick and colleagues titled “Noise in Cognition: Bug or Feature?” The argument is uncomfortable for anyone who likes the tax metaphor too much.
The brain, Sanborn’s group argues, performs probabilistic inference using a local sampling algorithm. To handle a world too complex for exact computation, cognition uses randomness as a search procedure — generating noisy hypotheses, sampling around them, settling on plausible ones. Cognitive noise, in this view, is not the system’s malfunction. It is the system’s method. Critically, they find that sensory noise and motor-response noise are minor; the bulk of variability arises in the cognitive computations themselves, and it is non-Gaussian, heavy-tailed, and autocorrelated over time. It looks like a random process because it is one — but a structured, useful one. The authors go further: noise is “an essential feature that underpins our ability to deal with an uncertain world.”
If they are right — and the evidence is now substantial — then a portion of professional pattern noise is exploration in disguise. When two thoughtful underwriters weight the same risk factors differently, they aren’t both broken; they’re each sampling a different region of hypothesis space, and over a population of cases the divergence may catch errors a single homogenized model would miss.
But notice what this does not excuse. Pattern noise might be exploration. Occasion noise — the same underwriter giving different answers on different days to the same file — cannot be exploration in any useful sense. The same person can’t be productively sampling around their own previous sample five minutes ago. The judges’ birthday effect isn’t search; it’s leakage. The 75% noise in your performance review is not your manager doing Bayesian inference. It’s contamination.
The cognitive science doesn’t dissolve the tax. It clarifies it. Some of the variance you see is the engine working. Some of it is sand in the gears. The job is telling them apart.
A decade of noise audits has converged on a small set of practices that consistently work — call them, in Kahneman’s phrase, decision hygiene. They are unglamorous and they help.
Independent judgments first, discussion second. When five people give estimates after one person has already spoken, you have one estimate and four anchored variations. Get every judgment in writing, separately, before the room talks. The “Estimate-Talk-Estimate” protocol used in Delphi forecasting comes from this insight.
Decompose the judgment. Instead of “what’s this case worth?” — score the case on five specific dimensions, weight them explicitly, then aggregate. This converts pattern noise (different intuitive weightings) into a parameter you can audit and, if you want, fix.
Use a relative scale, not an absolute one. Humans are dramatically more consistent ranking pairs (“is A worse than B?”) than scoring in isolation (“rate A from 1 to 100”). Wherever the decision allows, switch to comparison.
Delay intuition. Get the data, structure the comparisons, then let the gut speak. Intuition that arrives last is calibrated against evidence; intuition that arrives first becomes the evidence.
Actually run the audit. Pick twenty cases, hand them to two independent decision-makers, compute the variance, and look at the number with the executives in the room. Every organization that has done this has been shocked by the result. They predicted 10%. They found 50%. The shock is the point — it produces the political will the next four practices need.
And one more, lifted from a separate 2025 paper by Jens Sundh and colleagues at Uppsala: read the shape of your noise. Sundh’s group showed that noise distributions distinguish analytic reasoning (tight, near-Gaussian) from intuitive reasoning (heavy-tailed, lumpy). If you audit your noise and it has fat tails, you are not just inconsistent — you have decision-makers winging it on instinct in places you assumed they were calculating. The variance isn’t only a number to drive down. It’s a diagnostic readout of how the work is actually being done.
Noise is the error source nobody itemizes. It costs insurance customers, foster children, criminal defendants, software estimates, and the third of your performance review that wasn’t really about you. It hides in plain sight because each case has only one decider, because the institutions hurt by it lack the vocabulary to name it, and because — newly, troublingly — some of it is the thinking itself.
You will not eliminate it. The 2025 cognitive science suggests the engine of judgment runs on the stuff. But you can do three concrete things this session. Run a noise audit on your most important repeated decision. Replace at least one solo judgment in your workflow with two independent judgments aggregated. And the next time someone tells you that “professional discretion” is the same thing as “professional accuracy,” ask them when they last measured the gap.
The tax is going to keep coming out of every paycheck regardless. The only question is whether you start reading the receipt.
Decision hygiene is a stack, not a single rule
Independent judgments, decomposition, comparison instead of absolute scoring, delayed intuition, and an actual noise audit — the practices in this essay only work as a layered stack. Each one closes a different leak. The same applies to AI agents making decisions: a signed provenance record of what the agent decided, a portable rating of how often it’s right, and a trust-handshake layer that lets two strangers transact without re-litigating either — together they let you run a noise audit on machines the way you should already be running one on humans.
pip install agent-trust-stack · npm install agent-trust-stack · Hosted Chain of Consciousness · See a live provenance chain