Wishcycling in Code Review: When QA Theater Contaminates the Signal

A 0.5% contamination threshold rerouted 45% of global plastic waste overnight. Code review obeys the same math.

Published April 2026 · 11 min read

In January 2018, China set a number: 0.5 percent.

That was the maximum contamination rate it would accept in imported recyclables — the foreign matter, food residue, mixed plastics, and other non-target material allowed in any incoming bale. Half a percent. Down from a previous threshold near ten percent that, in practice, had been loosely enforced and routinely blown past.

The result of that single number, formalized as Operation National Sword, was immediate and global. China’s imports of waste plastic dropped 92 percent. Used paper imports fell 56 percent. Roughly 45 percent of the planet’s plastic-waste flows were rerouted overnight, scrambling toward Malaysia, Vietnam, Indonesia, Thailand, and Turkey — each of which subsequently set its own restrictions. The volume of waste hadn’t changed. The amount of actually recyclable material in any given bale hadn’t changed much either. What changed was the acceptable noise level.

That shift exposed something the recycling industry had quietly known for decades. The label “recyclable” had drifted from a property of materials to a property of attention. Items entered the recycling stream because they looked like the kind of thing that should be recyclable, or because the person dropping them in wanted them to be. The technical name for that, in the discard-studies literature, is wishcycling — placing waste in a recycling bin in the hope that it might be recycled.

Wishcycling sounds like a charming individual mistake. It isn’t. It’s a systems failure built out of thousands of charming individual mistakes that together contaminate the entire stream. A coffee cup with a plastic lining looks like paper but isn’t. A greasy pizza box rots a paper bale. A loose battery, dropped in by someone who genuinely meant well, can ignite a sorting line — fires at Material Recovery Facilities cost between $2,600 and over $50 million each, and a 2024 nationwide survey published in Waste Management clocked U.S. MRF contamination rates between 1 and 39 percent, averaging under 20, rising with throughput. Past a threshold, the math inverts: the recyclable stream becomes more expensive to process than landfill.

Now consider a parallel.

A senior engineer on your team gets a pull request notification. They open it, scan the diff, see the author’s name — a colleague they trust, who has never shipped anything obviously broken — and type “LGTM.” Maybe with a thumbs-up. Maybe four minutes elapsed between notification and approval. Maybe none of that happened consciously; the engineer just wanted to clear the queue before standup.

Each instance is rationalizable. The author is competent. The change looks small. Tests pass. There are forty more PRs in the queue. Saying anything substantive would slow things down.

Each instance contaminates the signal.

The data on rubber-stamping is not subtle

In 2015, three Microsoft researchers — Amiangshu Bosu, Michaela Greiler, and Christian Bird — analyzed 1.5 million review comments from five Microsoft projects in what remains the most-cited empirical study of code-review quality. Their headline finding: more than one in three review comments were classified by the original recipients as not useful. Not “harsh,” not “wrong” — not useful. Comments that didn’t change anything, didn’t catch anything, didn’t teach anything. They also found something more uncomfortable: the proportion of useful comments climbs steeply during a reviewer’s first year and then plateaus. Reviewers don’t get better at reviewing — they get acclimated to it.

A complementary signal comes from the open-source corpus. Studies of GitHub merge histories find roughly 46 percent of merged PRs received zero inline or top-level comments from any reviewer, and another nineteen percent received only trivial single-token responses. Cross-cutting practitioner surveys from large engineering organizations consistently report that 35 to 40 percent of pull requests are approved without a single change requested. A widely circulated industry account of Fortune 500 review practices — admittedly a secondary practitioner source, and worth weighting as such — describes large-scale teams approving on the order of 100 PRs per week per reviewer, with the median review taking under three minutes. The number is anecdotal; the shape it describes is not.

The Chromium project, to its credit, called this out in public. A chromium-dev mailing list post titled Please don’t rubber stamp code reviews argued, with the bluntness of an actual production codebase, that “if code is too hard to understand, it indicates the code needs to be factored or commented better.” The team’s response wasn’t a checklist. It was structural: in late 2025, Chromium rolled out a 2P (two-person) review requirement for code authored by non-committers — a contamination-rate fix, not an inspection fix.

Plot the failure shape of code review against the failure shape of curbside recycling. The shape rhymes.

	Wishcycling (Waste)	Wishcycling (Code Review)
The act	Drop a contaminant in the recycling bin	Approve a PR without reading the diff
The motivation	“I want this to be recyclable”	“I trust this author / I’m slammed”
Individual logic	The single item seems harmless	The single LGTM seems harmless
Stream effect	Non-recyclables degrade the whole batch	Empty approvals degrade the meaning of all approvals
Threshold that breaks the system	>0.5% contamination rerouted ~45% of global plastic waste	A modest fraction of unread approvals collapses signal
Worst downstream effect	MRF economics invert; recyclables go to landfill	“Approved” becomes a lossy label everyone discounts

The structural argument: wishcycling is a mistake about the relationship between the individual choice and the aggregate signal. A person with a paper coffee cup is not making a paper-versus-plastic decision; they are casting a vote about what counts as paper. A reviewer typing “LGTM” without reading is not approving one PR; they are casting a vote about what counts as approval. When enough votes go the wrong way, the label loses semantic content. “Recyclable” becomes a bin you point at on the way out the door. “Approved” becomes a button you click to turn the badge green.

The downstream training problem makes this timely

Code-review wishcycling has always degraded production quality. What’s new is that it now degrades the next generation of automated reviewers. In February 2025, Liu, Lin, and Thongtanunam at the University of Melbourne published Too Noisy To Learn: Enhancing Data Quality for Code Review Comment Generation (arXiv:2502.02757). They found that even after standard cleaning, only about 64 percent of comments in the most widely used code-review training set are valid. Roughly 36 percent are noise — vague, non-actionable, or off-topic. They then tried to filter the noise with LLM-based judges and got 66 to 85 percent precision, which sounds reasonable until you remember that “precision” here means the model’s noise filter is itself noisy. The result is a model that learns to generate review comments shaped like the comments in its training data — which, in practice, means a model that learns to produce plausible-looking LGTM-equivalents.

This is the recycling story exactly. A contaminated stream produces sorting algorithms trained on contaminated streams. The sorting algorithms then judge new streams. New streams get cleaner only at the rate the sorting algorithms can distinguish targets from contaminants — and they can’t, because they were trained on a corpus where targets and contaminants were already mixed.

Wishcycling has always carried a hidden cost: the appearance of doing the right thing absorbs the political and emotional energy that might otherwise have gone toward doing the right thing. The vast majority of Americans who report recycling regularly do not experience themselves as part of the problem; the fact that fewer than half can correctly identify what’s recyclable is, for them, a different story. The 2024 Waste Management survey found that gap embedded across every demographic. The developer equivalent is, if anything, more lopsided. Surveys of engineering teams routinely surface near-universal agreement that code review is critical — coupled with empirical findings that a sizable majority of reviews include nothing substantive. Self-reported quality behavior diverges sharply from observed quality behavior. The performance of the inspection has become the inspection.

A separate empirical thread sharpens this further. Jureczko’s 2020 study in IET Software on code-review effectiveness found that roughly 75 percent of issues raised in code review do not alter system behavior — they catch style, naming, convention, preference. That’s not a damning number on its own; clarity matters. But it does mean that the volume of substantive review is smaller than the volume of review comments suggests. Inside the third of comments that are even marked useful, the share that catches a behavior-changing problem is smaller still. The signal is thin even before the noise gets added.

Where the analogy is imperfect

It’s worth saying clearly: not every “LGTM” is wishcycling, and not every fast review is bad. A two-line bug fix on a well-tested module, by an author who ran the change locally and understands the surrounding code, doesn’t need a six-hour review. The Microsoft study found something instructive on this point: when the same reviewer reviews the same file for the fifth time, the proportion of useful comments rises to about 80 percent. Familiarity with code is signal. The trouble is that most rubber-stamping is driven by familiarity with the author — a different and weaker prior. A trusted author writing in unfamiliar code, reviewed by a familiar colleague, is exactly the configuration where wishcycling masquerades as efficient process.

It’s also fair to point out that some governance is real. Curbside recycling saved real material at real scale for decades; the National Sword shock was painful precisely because the prior system was partially working. Code review, even imperfect, catches things tests don’t — particularly issues of clarity, naming, and design that don’t surface as failures until much later. The argument here is not that quality processes are theater. It’s that contaminated quality processes are theater, and the contamination ratio is the variable that matters. A clean stream at 70 percent useful comments is doing real work. A stream at 30 percent is producing coverage statistics.

The third counter is the most important: heavier process on top of a contaminated stream is the wrong fix. This is the discard-studies insight that translates most cleanly. Adding more sorting symbols to packaging, more consumer-education campaigns, more required acknowledgments at the bin, produces more documentation of waste, not less waste. The fix is upstream — change the design of the packaging so the recyclability signal is unambiguous, or shift the cost of the wrong choice back onto the producer. Extended Producer Responsibility laws, now enacted in seven U.S. states since 2021, do exactly this: the producer who shipped the item bears the disposal cost.

The code-review equivalent is uncomfortable for the people who like checklists. Adding more required reviewers, more CI gates, more review templates, more automated linters on top of a culture of unread approvals produces more process artifacts and roughly the same quality. The fix is to change the contamination rate of the approval itself. Three concrete moves:

First, shift load to the author. A mandatory self-review pass, a change-set summary written in plain English (“here is what this changes and how I tested it”), explicit risk tagging, working tests for the new path — all of these move the cognitive work to the side of the system that already understands the change. The reviewer is no longer the sole quality gate; they are the second pair of eyes on a problem the author has already framed.

Second, stratify by risk. Google’s published engineering practices treat 4-hour turnaround as the standard for normal changes, with deeper reviews of 6 to 24 hours reserved for architectural ones. The fast path is not an excuse for skim approvals; it’s a recognition that not every diff carries the same risk. The stratification has to be visible in the workflow — labels, branches, conventions — so the reviewer knows which mode they’re in before they open the diff.

Third, make the social cost of an unread approval visible. The Chromium 2P rule does this structurally by requiring two persons to review a non-committer’s code; an unread approval has half the cover it used to have. A weaker version of the same idea: require approvers to leave at least one substantive comment, even if it’s a question. The point isn’t to slow things down. The point is that an approval should be the kind of thing you’d be willing to defend in a postmortem.

The practical takeaway

Don’t ask whether your team’s code review is “good.” That’s the wrong question, in the same way “are you a good recycler?” is the wrong question. Ask what fraction of approvals on your team carry semantic content — a question, a non-trivial change request, evidence of having actually read the diff. If that fraction is low, no amount of process layered on top will fix it. The intervention is to lower the contamination rate, by making reads visible and by shifting work to the side of the system where it belongs.

China’s number was 0.5 percent. The relevant number for your codebase is whatever fraction of approvals you’d be willing to defend in front of a future incident. If that number is much lower than the fraction you’re shipping, the gap between them is your wishcycling rate. The signal has already started to drift. The good news is that the discard-studies literature has a hundred years of evidence on what works: not better inspection, but better upstream design and clearer cost allocation. The same fix is sitting on the table for software, waiting to be picked up.

The hopeful coffee cup in the recycling bin is not, in itself, a moral failure. It’s a signal-design failure. Same with the four-minute LGTM. The people doing it are mostly trying to be helpful inside a system that has stopped distinguishing helpful from harmless. Fix the system, and the people will follow. Don’t fix it, and the next generation of automated tools will learn from the contaminated stream, judge new streams against it, and quietly hand back the same thin signal — only this time it will arrive faster, with better formatting, and indistinguishable from the real thing.

That’s the part to flinch at. Not the LGTM itself, but the moment it becomes uneconomic to tell the difference.

Sources: Bosu, A., Greiler, M., and Bird, C., “Characteristics of Useful Code Reviews: An Empirical Study at Microsoft,” MSR 2015; Liu, C., Lin, Z., and Thongtanunam, P., “Too Noisy To Learn: Enhancing Data Quality for Code Review Comment Generation,” arXiv:2502.02757, 2025; Jureczko, M., “Significance of code review impacting software quality measured by defects discovered post-release,” IET Software, 2020; chromium-dev mailing list, “Please don’t rubber stamp code reviews,” 2024–2025; Waste Management, 2024 nationwide MRF contamination survey; U.S. EPA and discard-studies literature on Operation National Sword, 2018–2024; Google Engineering Practices documentation on code-review turnaround.

An approval should be the kind of thing you’d defend in a postmortem.

Wishcycling collapses when the cost of a wrong choice is allocated back to the source. Chain of Consciousness creates a cryptographic, tamper-evident provenance chain for every approval, comment, and change — so the difference between a read and an unread “LGTM” stops being invisible. When approvals carry signed evidence of what was actually inspected, the contamination rate becomes measurable. And what’s measurable can be fixed.

pip install chain-of-consciousness · npm install chain-of-consciousness
See a live provenance chain →

← Back to all posts