Buggy Code Review: The Pipeline

A 3-file pipeline, 14 bugs, and the question that catches what code review misses.

Published April 2026 · 12 min read

At 9:30 on the morning of August 1, 2012, the New York Stock Exchange opened for trading and Knight Capital’s automated system began buying and selling shares. Forty-five minutes later, the firm had accumulated positions in 154 stocks totaling 397 million shares and $7.65 billion in exposure. By the time someone pulled the plug at 10:15, Knight Capital had lost $440 million — roughly $10 million per minute.

The cause was not a sophisticated exploit or a hardware failure. It was a reused bit flag.

When Knight Capital’s engineers deployed their new Retail Liquidity Program, one of ten servers failed to receive the update. The deployment script “would fail silently, continue to update the other machines, and report success,” according to the SEC’s administrative proceeding. That un-updated server still contained Power Peg, a deprecated order type from the early 2000s that had been abandoned but never removed from the codebase. The new code reused a flag that Power Peg had previously claimed. When the stale server received orders with the new flag set, it interpreted them as Power Peg orders — and Power Peg’s cumulative tracking was broken because the reporting code had been disconnected years earlier.

Each component was locally correct. The deployment script worked on every reachable server. The new Retail Liquidity code was correct. The old Power Peg code was correct for its era. The system failed because the assumptions between components were wrong, and no component checked whether the others were in the expected state.

This is a story about the bugs that live between files.

The fifteen-percent ceiling

Software engineer Capers Jones has spent decades measuring how well different methods catch bugs before they ship. His data on defect removal efficiency tells a consistent story: unit testing catches roughly 25% of defects. Integration testing catches roughly 45%. Formal inspections average about 85% for within-file defects. But most forms of testing are “less than 50% efficient” and “only about 35% efficient, or finding only one bug out of three.”

The U.S. average across all methods combined is approximately 85%. Combining pre-test inspections, static analysis, and at least eight test stages can push that to 99.65% — but any single method tops out around 85%.

That remaining 15% is where the expensive bugs live. Knight Capital’s $440 million. The Ariane 5 rocket that self-destructed 37 seconds after launch on June 4, 1996, destroying four satellites and over $370 million. The Therac-25 radiation therapy machine that killed patients between 1985 and 1987 because a concurrency bug between its user interface and beam control modules allowed lethal radiation doses when an operator typed fast enough — a race condition between two modules that shared state without locking, structurally identical to a read-modify-write gap in a message queue.

The bugs that survive every individual review method are the ones that don’t exist in any individual file. They emerge at the seams.

Three files, zero bugs

Here is a message processing pipeline split across three files. message_queue.py handles storage and retrieval — messages serialized as JSON to disk, dequeued by scanning for pending status, marked complete or failed with retry. processor.py handles message-specific logic — a handler registry dispatches by type, a dedup set prevents double-processing, email messages get sanitized, webhooks require HTTPS. worker.py ties them together — poll the queue, process what you find, handle exceptions, drain gracefully on shutdown.

Read each file and the logic is clean. Nothing obviously wrong.

Fourteen bugs. Five within individual files, three at file interfaces, six at the system level. A reviewer who reads each file in isolation catches maybe five — the missing json import in processor.py, the re.match that should be re.search, the XSS bypass that strips exact <script> tags but misses <SCRIPT>, <script >, <img onerror=...>, and everything else. These are per-file bugs. They’re visible, they’re catchable, they’re the kind of thing code review is built to find.

The nine that live between files require holding two or three mental models simultaneously. The processor’s in-memory dedup set and the queue’s on-disk persistence. The worker’s shutdown drain and the queue’s retry mechanism. The enqueue throughput and the dequeue polling rate.

Three unwritten contracts define the system’s actual behavior:

dequeue() assumes single-threaded access — no lock acquired despite a _lock_file attribute sitting right there in the constructor.
process() assumes the dedup set persists across worker lifetimes — it doesn’t; processed_ids = set() resets every time the worker restarts.
shutdown() assumes no concurrent workers will start during the drain — they might.

None of these contracts are enforced. None are documented. Each file is correct given its assumptions. The system breaks because the assumptions conflict.

What reviews actually find

In 2013, Alberto Bacchelli and Christian Bird at Microsoft Research conducted what remains one of the largest empirical studies of code review practices. They surveyed 165 managers and 873 programmers, interviewed 17 developers across 16 teams, and manually classified 570 code review comments from Microsoft’s CodeFlow tool.

The central finding was a mismatch between motivation and outcome. “Finding defects” was ranked the number-one reason developers do code reviews. But the most frequent actual outcome was code improvements — readability fixes, consistency tweaks, dead code removal. Twenty-nine percent of review comments addressed these quality-of-life concerns, not bugs. Subsequent analysis by Spadini and colleagues estimated that up to 75% of code review comments affect software maintainability rather than functionality.

The real benefits delivered by code review turned out to be knowledge transfer, team awareness, and education for new developers. Important benefits — just not the benefit everyone thought they were getting.

This maps precisely to the pipeline exercise. A reviewer looking at processor.py in isolation catches Bug 3 — the missing json import — immediately. It’s obvious. They might catch Bug 4 — re.match only matching at the start of the string — if they know the regex API well. But Bug 6, where the processor’s in-memory dedup set evaporates on restart while the queue’s on-disk state survives? That bug exists in the relationship between processor.py and worker.py. It’s invisible in either file alone.

Code review’s per-file structure systematically misses inter-file bugs. Reviewers focus on the code in front of them, not on the contracts between what they’re reading and what they haven’t opened.

Spadini and colleagues gave this class of bug an empirical name: “delocalized defects” — defects whose understanding requires holding multiple files’ logic in working memory simultaneously. Their research found a moderate association between catching delocalized defects and the reviewer’s working memory capacity. The association with other defect types was “almost non-existing.”

It’s not just that interface bugs are hard. They’re cognitively hard in a measurable way, and the measurement tracks with a specific cognitive resource — working memory — not with experience or expertise.

The assumption is the bug

On June 4, 1996, the Ariane 5 rocket self-destructed 37 seconds after launch. The cause was an integer overflow in the Inertial Reference System: a 64-bit floating-point value was converted to a 16-bit signed integer, and the value exceeded the 16-bit range.

The IRS software had been reused from the Ariane 4. On the Ariane 4, the horizontal velocity during early flight stayed comfortably within 16-bit range. The Ariane 5 had a different trajectory with higher horizontal velocity. Engineers had deliberately disabled overflow protection for that variable to stay within an 80% CPU workload target, “based on assumptions which were correct for the trajectory of Ariane 4, but not Ariane 5,” according to the ESA inquiry board. The alignment function itself was only needed for roughly 40 seconds of flight — a requirement inherited from the Ariane 4 that served no purpose on the Ariane 5. Over $370 million in satellites destroyed by a function that shouldn’t have been running at all.

The inquiry board concluded that the failure was a “systems engineering” problem, not a software bug. The software did exactly what it was told to do. The specification was wrong for the new context.

This is the assumption-contract problem in its purest form, and the pipeline code has the same structure. The _lock_file attribute in MessageQueue.__init__ is the pipeline’s disabled overflow protection. The safety mechanism was designed — someone thought about locking, created the path object, stored it as an instance attribute — but never connected it to control flow. The artifact of good intentions exists in the code. The actual protection doesn’t.

dequeue() is correct if you assume single-threaded access. The Ariane 4 software was correct if you assumed a slow trajectory. The assumption, not the code, is the bug.

Where this framing breaks

Three caveats before the practical bit.

First, scale changes the problem. A teaching exercise and a system managing $7.65 billion in exposure are different categories of risk. The bugs are structurally identical — processed_ids = set() and Power Peg’s disconnected reporting are the same persistence failure — but the organizational machinery needed to prevent them is not. Knight Capital needed deployment verification across ten servers, kill switches, position limits, and anomaly detection. The pipeline needs a persistent dedup store and a file lock.

Second, code review catches things this framing understates. The Bacchelli-Bird finding that reviews primarily produce code improvements rather than bug catches doesn’t mean reviews are low-value. Knowledge transfer and team awareness prevent future bugs by ensuring more people understand the system. The interface bugs in this pipeline might well be caught by a reviewer who worked on a similar queue last year — but that’s knowledge transfer doing the work, not the review process itself.

Third, the case studies here are all Western, large-organization, safety-critical systems. Interface bugs in a startup’s internal tool or a solo developer’s side project manifest differently and carry different costs.

The question that catches it

The practical skill the pipeline teaches is a single question: what does this file assume about the file that calls it?

When you review processor.py, reading top-to-bottom catches the missing import and the regex error. Those are within-file bugs — the kind reviews are built to find. But asking what processor.py assumes about its caller exposes the dedup problem immediately: processed_ids = set() assumes the processor instance lives as long as the system needs dedup guarantees. Does it? Who creates this instance? What happens when they recreate it?

The question works because it forces you into the cognitive territory where delocalized defects live. You have to hold two mental models — what this file does, and what the other file expects — and compare them. It’s working-memory-intensive, which is exactly why it catches bugs that passive reading doesn’t.

Three concrete moves for making this work in practice:

Read the constructor, then read who constructs it. The MessageProcessor constructor initializes processed_ids as an empty set. The Worker constructor creates a new MessageProcessor every time it starts. Those two facts, held together, are the bug.

Find every attribute that stores state, and ask where that state goes when the process dies. processed_ids lives in memory. The queue lives on disk. State that survives restart and state that doesn’t will disagree after a crash. Every mismatch in persistence boundaries is a candidate for Bug 6.

Look for safety mechanisms that exist as data structures but aren’t used as control flow. The _lock_file is a path object in the constructor, never referenced again. The Ariane 5’s overflow protection was a software feature that was deliberately disabled. Both are the footprint of someone who saw the problem and didn’t — or couldn’t — finish the solution.

Knight Capital’s Power Peg is the production version of processed_ids = set(). Both are state that’s supposed to persist but doesn’t — Power Peg’s cumulative tracking was disconnected; the processor’s dedup set resets on restart. Both are invisible to single-file review. Both cause catastrophic duplication: re-sent emails in the pipeline, infinitely refreshing trades on the exchange floor.

The cost difference between them is pedagogical versus catastrophic. The structural difference is zero.

The bugs that cost $440 million, over $370 million, and human lives share a common address. They live in the assumptions between files that nobody wrote down, in the contracts between components that nobody enforced, in the 15% that slips past every method designed to catch bugs one file at a time. The code is not where the bug is. The code is where you go to confirm what you already suspected — after you asked what each file assumes about every other file it touches.

Sources: SEC Administrative Proceeding File No. 3-15570, In the Matter of Knight Capital Americas LLC; ESA Inquiry Board, “Ariane 501 — Presentation of Inquiry Board Report,” 1996; Leveson, N. and Turner, C.S., “An Investigation of the Therac-25 Accidents,” IEEE Computer, Vol. 26, No. 7, July 1993; Bacchelli, A. and Bird, C., “Expectations, Outcomes, and Challenges of Modern Code Review,” ICSE 2013; Capers Jones, “Software Defect Removal Efficiency,” PPI International, 2011 revision; Spadini, D. et al., “Advancing modern code review effectiveness through human error mechanisms,” Journal of Systems and Software, 2024.

The bugs that cost $440 million lived in contracts nobody wrote down. This writes them down.

Chain of Consciousness creates a cryptographic, tamper-evident provenance chain for every decision in your pipeline — what each component assumed, what it checked, what it decided. When the contracts between components are explicit and auditable, the 15% ceiling starts to crack. Not a testing tool. An assumption-documentation layer that makes the invisible visible.

pip install chain-of-consciousness · npm install chain-of-consciousness
See a live provenance chain →

← Back to all posts