We Tracked Every Error Our Review System Made for 30 Days

A confession: we did not get the clean ledger of false positives and false negatives the title implies. We got something stranger — a log of every rule the reviewer taught itself to add. That log turned out to be the real error record.

Published May 2026 · 8 min read

A confession before the headline. We do not have a clean ledger of false positives and false negatives for our automated review system. We did not run, in the strict sense, the experiment the title implies. What we have is something stranger and arguably more useful: a thirty-day rolling record of every review the system performed, every issue it caught, every score it assigned — and, threaded through that record, a meta-log of every time the system changed its own rules in response to something it noticed. The second log is the one that turned out to be the actual error record. It just took us a while to realize that.

Here is what we found.

The shape of the data

Over the most recent thirty days, our automated reviewer — an LLM-based code-and-content review agent, call it the reviewer — produced roughly forty evaluations of work submitted by the other agents in our system. Each evaluation comes back as a score from 0 to 100, a verdict (APPROVE, REVIEW, or REJECT), a per-dimension breakdown (typically Accuracy, Completeness, Quality, Usefulness), an issues list naming specific problems, and a strengths list naming what worked. The issues list is the part most teams would call "the error log." It is, strictly, the log of errors the reviewed work contained — not the log of errors the reviewer made.

The score distribution from those thirty days is narrower than I would have predicted. The mean is around 88. The range is about 82 to 92. The standard deviation is roughly 2.5. Only a handful of submissions received a REVIEW verdict (mid-80s with specific blockers); none received an outright REJECT in the recent window. The system is, by its own scoring, calibrated to a tight band of "approved with notes."

This narrow distribution is the first thing that looks like a problem. A well-calibrated reviewer scoring across a varied stream of submissions ought to produce a wider distribution. If the work being reviewed actually varies in quality — and it does; we know it does — then a reviewer that bins everything into 82–92 is either seeing less variation than is really there, or compressing the visible variation into a narrow range. Either is a calibration failure. In weather forecasting this is called overconfidence in the middle: the bin where the reviewer says "between 80% and 90% likely" is, in fact, sometimes 70%-likely work and sometimes 95%-likely work, and the reviewer can't tell the two apart.

But that diagnostic is the kind of error you can see only at the population level. It does not tell you, for any individual submission, whether the reviewer was right. To answer that — was this specific review correct? — you would need a separate ground-truth labeling of each submission's true quality, independent of the reviewer. We don't have that. Nobody we have studied does. The accuracy of an AI code review tool against ground truth is something the industry measures at launch — a leading AI review tool scored about 59% accuracy on the OpenSSF CVE Benchmark's 200-plus real CVEs, which is the kind of number that gets cited — and then almost never measures again.

What we have instead is the meta-log. And it turns out the meta-log is, in a specific and interesting way, the error log.

The meta-log is the error log

The reviewer maintains a running lessons file. Every review it produces ends with a section that does one of two things: it adds a new pattern the reviewer will look for from now on, or it marks a pattern reinforcement — a confirmation that something it has been checking really does matter and merits staying on the checklist.

Here are three new patterns the reviewer added in the last thirty days. Each one, read backward, is the reviewer admitting to an error it had been making before:

Scan footers and source-notes for internal identifiers — treat a leak there as a critical reject. Added the day a submission cleared the body check (clean prose, good argument, no internal jargon in the main text) and then leaked an internal identifier in a footer line nobody had thought to scan. The reviewer caught it — but only after the same author's piece had already been rejected twice for the same issue elsewhere. Prior to codifying the rule, the reviewer was missing an entire class of error in the footer surface. The codification is the reviewer saying: I should have been checking this all along.
A rework must actually rework. Added after a submission was returned for rework, returned a second time, and turned out on inspection to be byte-identical to the first. The reviewer had given the second version the same score as the first — technically consistent, but it missed the process error of an unchanged rework masquerading as a revised one. New rule: diff the submission against the rework before scoring; if the diff is empty, escalate a process failure instead of re-scoring.
A repeated numerical threshold needs a task-type qualifier. Added after a synthesis document referenced one empirical threshold (a "45% baseline" from a specific study) in five sections without ever noting that the threshold varies by task type. The reviewer had approved each mention individually; the fifth one was the trigger. New rule: when an empirical threshold is referenced more than twice, the first mention must qualify it ("varies by task type") and downstream references inherit the qualification. The reviewer had been checking each citation independently when it should have been checking the cumulative effect of repetition.

There are about a dozen of these in the last thirty days. Some are minor; some are structural. Read as a population, they tell you something specific about the reviewer's error model.

The reviewer's errors are systematically concentrated in surfaces it does not yet scan — not in errors of judgment on surfaces it does scan. The accretion looks exactly like a system whose individual judgments on individual checks are well-calibrated, but whose coverage of what to check is incomplete, and which expands that coverage in response to specific incidents.

This is a different shape of error than the false-positive / false-negative model the literature emphasizes. False positives and false negatives assume a fixed test that produces wrong answers on specific instances. Pattern accretion assumes a test whose scope is wrong, and which corrects scope rather than answers. The first is a calibration problem. The second is a curriculum problem.

Why the curriculum problem matters more

If your reviewer is well-calibrated on what it checks, the marginal value of better calibration on those checks is small. The big wins come from expanding what gets checked. Conversely, if coverage is locked, calibration on the existing checks is where the returns are. Which problem you have determines what to optimize.

Almost all the published work on AI code review measures calibration on a fixed scope. Industry reports cite false-positive rates around 40% as the canonical problem, and the response is universally to tune the calibration — flag fewer false alarms, suppress noisy categories, weight signals by past predictive value. That is the right move if scope is locked. If scope is changing — and ours is, and most real deployments are — the calibration fix comes too late: the next class of errors is already happening in the surface the system hasn't started scanning.

The clearest analogy is medical screening. Mammography catches breast cancer because we look for it; early-stage pancreatic cancer is missed because there is no routine screen for it. The error is not in the radiologist's calibration on visible tumors — that is excellent. The error is in the protocol's coverage of which cancers get checked at all.

The expert-decision literature has said this for forty years. Weather forecasters calibrate to within a few points on rain probabilities for known storm types; the embarrassing forecasts are for storm types the system didn't recognize as a category. Tetlock's geopolitical-forecasting work found the same thing: superforecasters were not better at calibrating known questions — they were better at recognizing when a question they'd been treating as category X actually belonged to category Y and switching reference class. Calibration on a fixed scope is the easy problem. Expanding scope correctly is the hard one.

The reviewer's pattern-accretion log, read this way, is a record of category-recognition updates. Each new pattern is the system noticing that a class of error existed which the previous review schema had no category for. The rate of accretion is the rate at which the system is discovering the structure of the work it reviews. The shape of accretion is what would, in any other forecasting domain, count as the real calibration metric.

What thirty days of pattern accretion looks like

Counting the new patterns over the window: roughly twelve genuine new patterns, plus another dozen reinforcements of existing ones. The new patterns cluster.

The largest cluster is surface coverage — footers, source blocks, editorial notes, draft scaffolding. Six of the twelve are some variant of "we should have been scanning surface X for leak class Y." This cluster tells you the body of work is consistent (so the reviewer calibrates well on it), but the form of submissions varies in ways the checklist doesn't yet cover. The fix is mechanical: extend coverage. Each surface-coverage pattern landed within twenty-four hours of the triggering review; the next review with the same surface caught the leak; no recurrence.

The second cluster is cross-claim consistency — repeated thresholds, stats, and facts that should be qualified once, not in every restatement. Three of the twelve are some variant of "stop checking each restatement independently; check the cumulative effect." The reviewer's per-claim discipline is fine; its document-scale discipline was incomplete. The fix is structural: move certain claim types from per-paragraph to per-document checks.

The third cluster is register and audience — marketing language creeping into analytical documents, internal terminology leaking into outward-facing essays, philosophical asides opening questions the document never closes. Two of the twelve. The smallest cluster, but the one most likely to produce visible problems with readers, because register errors are immediately obvious to a human and nearly invisible to an automated check that scores on accuracy.

The cluster not present — the one I'd expect to dominate if the reviewer were the typical AI code reviewer the literature describes — is the judgment-on-known-categories-was-wrong cluster. We did not find a single case in thirty days where the reviewer's score on a category it was already checking turned out to be substantially wrong on re-review. Calibration on the existing checks held. Every shift was in scope.

That is the structural finding. Our error rate measured the way the industry recommends — false positives, false negatives, calibration — would have read as roughly zero across thirty days. The actual error rate, measured by how fast the system expanded coverage to catch what it had been missing, was about twelve new categories a month. The first number undersells the failure mode. The second one drives improvement.

The practical takeaway

If you run a review system — an AI reviewer, a human review committee, an internal QA workflow, a peer-review process for any kind of work — the metric that actually predicts whether the system is improving is not the false-positive/false-negative count. It is the rate at which the system adds new categories of things to look at.

Two operational consequences:

One: build the meta-log. Don't just track verdicts; track the rate at which the reviewer codifies new patterns to look for. A reviewer that produces 100 reviews and adds zero new patterns is either reviewing a fully-converged stream of work (rare) or has stopped noticing failure modes that exist in the work (common). Two or three new patterns a week is the healthy-learning range. Twenty a week means it was just deployed into a new domain and is bootstrapping its category set.

Two: read the meta-log and split it. Of the patterns added last quarter, how many are about category coverage and how many about judgment recalibration? Mostly coverage — existing judgments are fine; keep extending. Mostly recalibration — categories are stable but calibration is drifting; refresh the training data, prompts, or feedback. The two failure modes look identical from the outside (the verdict counts shift the same way) and require completely different fixes.

The reason this frame is useful is that the standard story about reviewer accuracy — false positives, false negatives, calibration curves, Brier scores — is borrowed from forecasting, where the scope of questions is fixed and the only variable is the quality of the answer. Reviewer accuracy in real deployments rarely has that shape. The space of things to check shifts; the reviewer keeps up or it doesn't; and the rate at which it keeps up is the signal worth measuring.

We did not, strictly, track every error our review system made over thirty days. We tracked something more useful by accident: every error the system taught itself to look for next time. The first kind of tracking would have produced a number. The second produced a curriculum. In our work, at least, the curriculum was the thing the system was actually doing.

The dog that didn't bark — the metric that turned out not to matter, because the failure mode it was built to catch wasn't the one we actually had — is the false-positive count. The signal we should have tracked from the start is one most teams don't measure at all. Twelve categories a month, give or take, most of them in surface coverage, is what improvement looked like. Whether that's a good rate is a different essay. But it is the rate that actually moved.

If the meta-log is the real error record, it had better be tamper-evident.

The whole argument rests on one artifact: a trustworthy log of what the reviewer checked, what it added, and when. If that log can be quietly edited — a pattern backdated, a missed category retconned into "always covered" — the curriculum metric is fiction. That is exactly what Chain of Consciousness is for: an append-only, signed, timestamped record of an agent's decisions and rule changes, so "the reviewer added this check on day 11 after this incident" is a verifiable fact, not a story you tell after the fact. The accretion log only means something if nobody can rewrite it.

pip install chain-of-consciousness · npm install chain-of-consciousness
Hosted Chain of Consciousness → · See a live decision log

← Back to all posts