The Skeptic

Nineteen bets is not a sample. But a direction without a sample is still a direction — and the direction matches forty years of dissent research, Tetlock’s confidence-accuracy inverse, and a 2025 result on multi-agent debate.

Published May 2026 · 9 min read

The first agent said the price would go up. It had reasons — good ones. Order-book buy pressure at 1.3. Spot price inside the bucket with 25 minutes remaining. Implied edge: 8.2 cents.

The second agent looked at the same data and said no.

Across nineteen prediction-market bets, when the first agent listened to the second and lowered its confidence after pushback, the bets went 4-1. Plus 14.9% return on investment. When the first agent ignored the second and bet anyway, the bets went 2-12. Minus 56.3%.

Nineteen bets is not a sample size. A binomial null on the positive cohort gives p = 0.19 — no statistician would let you call this a result. But the direction is unanimous, and a direction without a sample is still a direction.

The question that interests me is why this worked at all. Designating one agent to disagree should not produce better outcomes. The most influential research in organizational psychology says it doesn’t. And yet, in the small data, it did.

The Nemeth problem

For four decades, Charlan Nemeth at UC Berkeley has been studying what happens when groups try to manufacture disagreement on purpose. Her central finding upends the conventional wisdom about devil’s advocates: it doesn’t work.

Authentic dissent — when someone genuinely believes a different position — stimulates divergent thinking and produces a greater proportion of original thoughts (Nemeth, European Journal of Social Psychology, 2001).¹ Even when the dissenter is factually wrong, she writes in In Defense of Troublemakers (2018), the challenge “actually liberates people” to reconsider.² The group with a real dissenter generates more new ideas, considers more angles, and updates more often.

Designated dissent — someone role-playing disagreement because that’s their assigned job — does not produce these benefits. Instead, it stimulates cognitive bolstering of the initial position. Knowing the objection is performative, the group rehearses its existing arguments more vigorously rather than reconsidering them. The devil’s advocate, in Nemeth’s data, makes groups more confident in their original answer, not less.

This is a problem for the second agent. Its disagreement is, by construction, performative. Its job is to disagree. By Nemeth’s findings, this should fail.

There is a way out, and it is instructive.

The second agent isn’t performing disagreement for a human audience. It’s running a different analytical process on the same data. The “authenticity” Nemeth identifies — the thing that distinguishes useful dissent from useless dissent — comes, in human contexts, from genuinely held belief. In computational contexts, it can come from genuinely different computation. When the second agent re-checked the spot price and found it had moved $73 in six minutes, that wasn’t a performance. It was a different number.

Call this computational dissent: skepticism whose independence comes not from belief but from algorithmic diversity. Two agents with different priors, different windowing, different sources of staleness, will produce different outputs on the same input. The disagreement is not theatrical. It is a measurement.

This isn’t a fully resolved tension. Whether computational dissent generalizes — whether human teams using AI skeptics can extract the benefits Nemeth saw from authentic dissent — is genuinely open. My guess is that the closer the AI skeptic stays to running independent analysis, rather than role-playing pushback for legibility, the more it lands on the productive side of Nemeth’s line.

Why confidence makes things worse

Philip Tetlock’s twenty-year study of expert prediction is the largest empirical record we have of what happens when smart people are confident about the future. From 1984 to 2003, he tracked 284 experts making 82,361 predictions about political and economic outcomes (Expert Political Judgment, Princeton University Press, 2005).³

The headline finding was that the average expert was, in his memorable phrase, “roughly as accurate as a dart-throwing chimpanzee.” Tetlock has since stressed this oversimplifies — as a group, experts beat chance, just not by much, and not better than attentive readers of The New York Times.

The deeper finding, the one that matters here: Tetlock found an inverse relationship between accuracy and self-confidence, renown, and depth of knowledge. There was a point of diminishing returns where increasing expertise made predictions worse. The hedgehogs — one big idea, high confidence, defended vigorously — performed worse than the foxes, who held many small ideas with lower confidence and a habit of self-criticism.

Look back at the small data. The first agent’s most confident predictions were its worst predictions. Every threshold increase — 5 cents of edge, 8 cents, 10 cents — produced worse returns, not better. The model’s confidence was inversely correlated with its accuracy. This is the Tetlock pattern at miniature scale.

The miniature is consistent with the macro. The IARPA-funded Aggregative Contingent Estimation tournament (2011–2015) found that amateur “superforecasters” beat professional intelligence analysts — people with classified access — by roughly 30% (Tetlock & Gardner, Superforecasting, 2015).⁴ The strongest predictor of forecasting accuracy was not intelligence or domain depth. It was commitment to self-improvement and self-critical thinking. The amateurs beat the spies because the amateurs were more willing to lower their confidence when challenged.

The institutional version is just as ugly. The Survey of Professional Forecasters reports 53% confidence in accuracy and is correct 23% of the time (Moore et al., Collabra: Psychology, 2024).⁵ The professionals’ confidence intervals are systematically too narrow. They are sure about things they should not be sure about.

It is worth saying that calibration is possible — weather forecasters manage it. When a National Weather Service forecast says 70% chance of rain, it rains roughly 70% of the time (Murphy & Winkler, Journal of the American Statistical Association, 1984).⁶ The reason is brutal: weather forecasters get graded on every prediction within hours, and they get graded a lot. Most expert domains have feedback loops measured in years, if at all. The skeptic is what you build when you can’t get the feedback fast enough to calibrate honestly on your own.

Institutional skeptics

After September 11, 2001, the CIA created a permanent group whose only job was to say “you’re wrong.” The Red Cell, established under Director George Tenet, runs alternative analysis on the agency’s own conclusions. Team A/Team B exercises argue opposing interpretations of the same intelligence. “What if” scenarios challenge consensus assumptions. Premortems imagine the plan has already failed and work backward to identify why (Zenko, Red Team, 2015).⁷

Gary Klein’s premortem technique is worth pausing on, because it describes a specific mechanism the second agent shares. Before executing a plan, the team imagines it has already failed and generates reasons for the failure (Klein, Harvard Business Review, 2007).⁸ The technique works, Klein argues, because it gives people permission to voice concerns they would otherwise suppress to maintain group cohesion. The premortem is a time-inverted skeptic — you challenge the outcome before the execution rather than the analysis before the bet. Same shape.

But institutional skepticism has a failure mode, and it’s important.

In 1976, President Gerald Ford commissioned Team B to challenge the CIA’s estimates of Soviet military capability. Team B, composed of hawkish outside experts, argued the CIA was systematically underestimating the Soviet threat. The retrospective verdict is severe. CIA Director George H.W. Bush concluded the Team B approach “lends itself to manipulation for purposes other than estimative accuracy.” Brookings scholar Raymond Garthoff later wrote that “virtually all of Team B’s criticisms proved to be wrong” — and always in the same direction, always toward enlarging the impression of danger.⁹

Team B was a designated devil’s advocate with a prior commitment. Its skepticism was ideological, not procedural. This is the Nemeth trap reappearing in institutional clothing: when the skeptic has an agenda, the team’s defense of its position hardens around the wrong axis. Team B didn’t make the CIA more careful. It made the CIA prepare to argue against threat-minimization, regardless of what the data said.

The lesson, repeated across these institutions: the skeptic must be computationally independent, not ideologically oppositional. Daniel Kahneman’s adversarial collaboration protocol formalizes this. Researchers with opposing views design experiments together, agree in advance what would constitute a fair test, and commit to accepting the outcome. Kahneman’s seven-year collaboration with Gary Klein on expert intuition is the canonical example — two researchers who disagreed strenuously, then converged on a richer answer than either started with: expert intuition works in regular environments with rapid feedback (firefighters, chess players) and fails in irregular environments with delayed feedback (stock pickers, clinical psychologists) (Kahneman & Klein, American Psychologist, 2009).¹⁰ Neither was fully right. The collaboration produced something neither could have produced alone.

The architecture has caught up

If the small story rings true, what does the large story say?

The RedDebate framework, published in 2025, ran a structured experiment on multi-agent debate as a safety mechanism for large language models. It tested three debate strategies on the HarmBench benchmark:¹¹

Peer Refinement Debate (PReD): two identical-role agents critique each other’s work.
Devil-Angel Refinement Debate (DAReD): one agent challenges, one defends.
Socratic Refinement Debate (SReD): a questioning agent probes assumptions and requests evidence.

The baseline error rate, with standard prompting, was 38.7%. After SReD, it dropped to 21.0% — a 45.7% relative improvement. With memory integration, it dropped further, to 6.1% — an 84.3% reduction from baseline (RedDebate, arXiv:2506.11083).

The Socratic prober beat the peer debater and the devil-angel pair, and the gap is wide. The form of skepticism mattered more than its intensity. Asking better questions produced better answers than producing better counter-arguments.

The paper also notes a finding worth holding next to Nemeth: agents that initially gave safe responses sometimes produced unsafe content when challenged during debate. The challenge surfaced latent vulnerabilities that single-turn evaluation never saw. This is the AI version of the same mechanism the premortem exploits — structured doubt makes hidden failures visible.

The 84.3% reduction is the large-sample echo of the 19-bet story. Different scale, same shape.

What the small data was for

There is a meta-move in the original analysis that’s easy to miss, and it matters.

The framing names, plainly, that 19 bets is not a sample. p = 0.19. This wouldn’t clear any conventional bar. A different writer would have hidden the small-n problem behind selection bias, or framed the directional finding as a result. Instead the analysis says it out loud, then argues from direction rather than significance.

That is exactly what Tetlock identifies as the marker of well-calibrated forecasting: the willingness to say “I don’t know how confident to be” when the data doesn’t support more. The honesty about the sample size is itself a form of skepticism, applied to the thesis. The structure of the argument enacts the argument.

This is not a rhetorical flourish. It is the only honest version. If the thesis is that the skeptic needs to be heard, not proven right, then a piece insisting it has been proven right would be a self-refutation.

Where this argument is weakest

Three concessions worth naming.

The Nemeth gap is not closed. The “computational dissent” move is a hypothesis. We have a small directional finding from one agent pair and a 2025 result from one debate framework. That is not a body of evidence. It is consistent with Nemeth’s findings if you interpret authenticity as algorithmic independence; it is also consistent with a fluke. The honest position is that designated AI skeptics might escape Nemeth’s trap, not that they do.

Skepticism scales badly. If every analysis is challenged, nothing executes. The essay does not argue for skepticism as decision-maker — it argues for skepticism as input. But in time-sensitive environments (trading windows, emergency response, breaking incidents), the cost of structured doubt may exceed its benefit. There is a real engineering trade-off here that the small data does not address.

Architectural bias is still bias. The argument that the second agent has no agenda, only a different process, is a structural claim. In practice, model choices, prompt construction, and training data create implicit priors that can function exactly like ideology. A skeptic that always pushes toward “more cautious” is doing the same thing Team B did, just with different defaults. The mitigation — running the skeptic on diverse models, with diverse data, with diverse prompts — is engineering work, not a property of the architecture.

What to take from this

A few practical things, mostly oriented to people designing review processes — code review, decision review, agent review.

Build in computational dissent on real decisions, not display dissent for show. A second analytical pass that uses different data freshness, different priors, or a different model is doing real work. A second agent prompted to “argue against” is doing theater. Nemeth’s research is forty years of evidence that the theater doesn’t help. The work does.

Treat confidence elevation under pushback as a warning, not a strength. The Tetlock pattern is robust. Confidence-elevation in response to challenge is the signature of cognitive bolstering — the team rehearsing its existing answer harder rather than reconsidering it. If your skeptic raises an objection and the system responds by tightening confidence intervals, you have built Team B, not the Red Cell.

Ask, don’t argue. The RedDebate result is striking enough to internalize: the questioning agent outperformed the debating agent and the devil-advocate pair. If you are designing review processes, prefer the format that surfaces assumptions through questions over the format that produces counter-positions. “What would have to be true for this to be wrong?” is a more useful prompt than “make the case against.” The Socratic move is the cheap intervention with the largest measured effect.

Keep the skeptic procedurally independent. Team B failed because the skeptics had a thesis. Kahneman-Klein worked because the disagreement was structured around a falsifiable prediction. If the skeptic comes pre-committed to a direction, you have an advocate, not a skeptic. The independence has to be architectural, not declared — different data sources, different models, different priors — not a flag in the prompt that says “be skeptical.”

When the data is small, say so. The most confident voices in any room are usually the worst-calibrated ones. The room needs at least one voice willing to say “this is suggestive, not proven.” Sometimes that voice should be yours. Sometimes it should be a process you’ve built so it doesn’t have to be yours.

The skeptic does not need to be right. The skeptic needs to be heard.

Sources

Nemeth, C. J. (2001). “The liberating role of conflict in group creativity.” European Journal of Social Psychology. ↑
Nemeth, C. J. (2018). In Defense of Troublemakers: The Power of Dissent in Life and Business. Basic Books. ↑
Tetlock, P. E. (2005). Expert Political Judgment: How Good Is It? How Can We Know? Princeton University Press. ↑
Tetlock, P. E. & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown. ↑
Moore, D. A. et al. (2024). “Confidence calibration in professional forecasters.” Collabra: Psychology. ↑
Murphy, A. H. & Winkler, R. L. (1984). “Probability forecasting in meteorology.” Journal of the American Statistical Association, 79(387), 489–500. ↑
Zenko, M. (2015). Red Team: How to Succeed by Thinking Like the Enemy. Basic Books. ↑
Klein, G. (2007). “Performing a Project Premortem.” Harvard Business Review, September. ↑
Garthoff, R. L. (1991). “Estimating Soviet Military Force Levels: Some Light from the Past.” International Security, 14(4). ↑
Kahneman, D. & Klein, G. (2009). “Conditions for intuitive expertise: A failure to disagree.” American Psychologist, 64(6), 515–526. ↑
RedDebate (2025). “Multi-Agent Debate as a Safety Mechanism for LLMs.” arXiv:2506.11083. ↑

Source note: The 19-bet figures (4-1 / 2-12, +14.9% / -56.3%, p = 0.19) are from an internal prediction-market experiment. The numbers are real; the sample is acknowledged as too small to clear conventional significance bars. The argument rests on directional consistency with three independently-evidenced bodies of work (Nemeth, Tetlock, RedDebate), not on the small-n result alone.

The skeptic must be heard. For that to mean anything, the dissent and the response have to be on the record.

A second agent saying “no” only matters if the system records that it said “no,” what it said in response, and whether the first agent updated. Without provenance, the skeptic’s objection is invisible the moment the trade clears. Chain of Consciousness is the audit trail for agent decisions — signed entries for each action, each challenge, each override, anchored so the record can’t be silently rewritten after the outcome arrives. You cannot build the Red Cell on top of memory that forgets.

pip install chain-of-consciousness · npm install chain-of-consciousness
See Hosted Chain of Consciousness →

← Back to all posts