Adversarial Dialogue

A verdict is only as trustworthy as the adversary it survived. A claim no one argued against is an unopposed assertion wearing a verdict's clothes.

Published June 2026 · 9 min read

For almost four hundred years, the Catholic Church employed a professional cynic. The job had a name that has since escaped into everyday English: the advocatus diaboli, the Devil's Advocate. Pope Sixtus V formalized the office in 1587, and its sole purpose was beautifully, deliberately negative. When the Church considered making someone a saint, the Devil's Advocate's job was to argue against it: to take the most skeptical possible view of the candidate's character, to hunt for holes in the evidence, and above all to attack the supposed miracles, to insist they were coincidence, exaggeration, or fraud. He was paid to lose, most of the time. But he was paid to make the other side win the hard way.

Then, in 1983, Pope John Paul II abolished the office. The stated goal was reasonable and modern-sounding: make the canonization process simpler, faster, cheaper, more productive. Remove the bottleneck. Streamline.

It worked, in the way streamlining always works. In the 396 years that the Devil's Advocate stood in the way (1587 to 1983), the Church canonized roughly 300 saints. In the 22 years after he was fired (1983 to 2005), it canonized around 500 and beatified more than 1,300. John Paul II alone made about five times as many saints as all his twentieth-century predecessors combined.

You can read that as a triumph of efficiency or as a cautionary tale, and which one you choose reveals what you believe about how trust is made. Because here is the uncomfortable reading: the Devil's Advocate was never the obstacle to sainthood. He was the evidence for it. A canonization that had survived a skilled, motivated, competent attacker meant something precisely because it had survived him. Remove the attacker and you don't get more trustworthy saints faster. You get more saints, and a verdict that no longer means what it used to. The throughput went up fivefold. The thing the throughput was measuring quietly changed underneath it.

This is the most important and most ignored principle in the design of any system meant to produce trustworthy output, software very much included: a verdict is only as trustworthy as the adversary it survived. A claim that no one argued against is not a validated claim. It is an unopposed assertion wearing a verdict's clothes.

A single voice confirms itself

The reason adversarial structure matters so much is that the alternative, a single voice evaluating its own conclusion, has a fatal, well-documented tendency: it confirms itself. Confirmation bias is not a personal failing you can discipline away; it is the default behavior of any process that looks for reasons it's right instead of reasons it's wrong. A lone reviewer reads a design and asks "does this seem fine?" and the answer, for almost any competent-looking design, is yes. The questions that actually find the flaw, "under what input does this break? whose incentive does this violate? what does this quietly assume?", are adversarial questions, and they do not get asked by a mind that is rooting for the conclusion.

The deep move in every dependable truth-finding system, then, is the same: you do not ask one mind to be its own critic. You structure opposition into the process so that someone, or something, is institutionally responsible for trying to make it fail. The Church understood this in 1587. We keep re-discovering it, domain by domain.

The courtroom: truth as a contest

The most familiar version is the adversarial legal system. In a common-law trial, the truth is not handed down by a single wise investigator. It is fought over: a prosecutor whose job is to convict and a defender whose job is to acquit, each motivated to dismantle the other's case, in front of a judge and jury who decide. The genius of the design is that it does not require either advocate to be fair. It requires each to be partisan, and trusts the collision of two committed partisans to surface what a single investigator, however well-meaning, would miss. You trust the verdict not despite the fight but because of it. A confession extracted with no defense present is suspect in exactly the way a canonization with no Devil's Advocate is suspect: nothing tested it.

Science: validated means survived an attack

Science encodes the same principle as its core epistemology, thanks largely to Karl Popper. A scientific theory is never confirmed: confirmations are cheap, and you can find supporting evidence for almost anything if that's what you're looking for. A theory earns its standing by surviving sincere attempts to falsify it. The theory of relativity mattered not because Einstein believed it but because it stuck its neck out: it predicted starlight would bend by a specific amount, an experiment could have killed it, and the experiment instead failed to. The adversary in science is the falsification attempt, and a hypothesis that has not survived one is not "probably true." It is untested, which is a completely different thing that often looks identical from the inside. An idea everyone likes and no one has attacked is the scientific equivalent of a saint with no Devil's Advocate.

The machines learned it too

Here is where it turns concrete for anyone building with modern AI, because the most powerful machine-learning ideas of the last decade are, structurally, the Devil's Advocate rebuilt in code.

In 2014, Ian Goodfellow and colleagues introduced the Generative Adversarial Network. A GAN is two networks locked in a zero-sum game: a generator that tries to produce fake data convincing enough to pass as real, and a discriminator whose entire job is to catch the fakes. The generator never gets told what a good output looks like. It only gets an adversary trying to expose it, and it improves by repeatedly failing to fool a critic that is itself getting sharper. Take the discriminator away and the generator has nothing to push against and learns nothing. The adversary is not a check on the learning. The adversary is the learning signal.

And in 2018, Geoffrey Irving and Paul Christiano proposed "AI safety via debate", a proposal that reads like the Vatican's 1587 office written as an algorithm. Two AI agents are given a question and argue opposite answers, taking turns, and a human (who may not be expert enough to settle the question directly) judges which agent argued more truthfully and usefully. The insight underneath it is the one that makes this whole essay practical: for many hard questions, the right answer is far easier to recognize than to generate. A non-expert can't write a correct proof, but watching two experts attack each other's reasoning, the non-expert can often see whose argument breaks. Adversarial dialogue is the mechanism by which a weak verifier can supervise a strong, unreliable expert, which is precisely the position you are in every time you ask a powerful model a question you couldn't answer yourself.

That last point is the one to carry out of here. You are the non-expert judge. The model is the expert you cannot fully check. A single model asserting an answer gives you no adversary, and therefore no verdict, only a confident-sounding generation you have to take on faith. The fix is not a better single answer. It is a second voice whose job is to attack the first.

The lesson for anyone who ships

A single voice confirms itself, so a single-voice process produces output you cannot trust, no matter how good the single voice is. The defenses you actually rely on are all adversarial dialogues in disguise, and the dangerous moments are the ones where you quietly abolish the adversary to go faster, exactly as 1983 did.

An LLM that agrees with you has no Devil's Advocate. The frictionless, agreeable answer is the canonization with the skeptic removed. Make a second model attack the first's answer, find the flaw, build the counterexample, and judge the survivor. Verification is easier than generation; use that.
A design no one red-teamed is an untested theory. Don't ask "does this look right?" (the self-confirming question). Assign someone, or something, to ask "how do I break this?" and reward them for succeeding.
A benchmark you tuned your system to pass is a trial where the prosecution works for the defendant. A metric with no adversary optimizing against it is the abolished office: throughput up, meaning down.
A code review that rubber-stamps is a Devil's Advocate who has been told to approve. The reviewer's job is not to bless the change; it is to find the bug. A review that never says no is not a check; it's the 1983 reform applied to your main branch.

In every case the move is the same: when you want to trust an output, do not strengthen the single voice. Introduce a genuine adversary and let the output survive it.

The honest part: a fake adversary is worse than none

There are two reasons the Devil's Advocate keeps getting abolished, and both are real, so the principle needs its limits stated plainly.

First, adversarial dialogue is slower and more expensive: that is the entire point of it and the entire reason it gets cut. The fight takes time the streamlined process doesn't spend. Sometimes that trade is correct: not every decision deserves a trial. The discipline is to spend your adversaries where being wrong is expensive, the irreversible, the load-bearing, the thing many people will trust, and to accept faster, unopposed verdicts only where a mistake is cheap and recoverable. You are not obligated to red-team a button color. You are obligated to red-team the thing that, if wrong, you can't take back.

Second, and more insidious, a captured or incompetent adversary is worse than none, because it manufactures the appearance of a survived attack without the substance. A Devil's Advocate who is told the outcome in advance, a discriminator too weak to catch the generator's fakes (GANs famously collapse when the critic is outmatched: the generator finds one cheap trick and stops improving), a reviewer who approves to avoid conflict, a "debate" where one side is handed the losing position: each produces a verdict that looks tested and isn't. This is the failure mode to fear most, because it is invisible from the outside: the ceremony of opposition with none of the friction. A real adversary has to be genuinely motivated and genuinely competent to win. If your skeptic cannot, in principle, kill the claim, your skeptic is set dressing.

What to actually do

So here is the question to ask of any output you're about to trust, a model's answer, a design, a result, a number on a dashboard: "What adversary tested this, and could it have won?"

If the answer is "none", if no second voice was tasked with attacking it, with finding the input that breaks it, the assumption that fails, the miracle that's really a coincidence, then you do not have a verdict. You have an assertion that nothing has yet contradicted, which feels identical to a validated claim right up until the moment reality plays the role of the adversary you declined to hire.

And if the answer is "yes, a real one, and it lost", then you have something the streamlined process can never give you, no matter how fast it runs: a conclusion that earned its standing the hard way, by surviving someone who was genuinely trying to take it down. That is what the Church bought for almost four hundred years with the Devil's Advocate, and what it stopped buying in 1983. Build the adversary back in. The friction is not the cost of the trust. The friction is the trust.

Sources: The Devil's Advocate (advocatus diaboli): office formalized by Pope Sixtus V (1587), abolished by Pope John Paul II (1983); role = argue against canonization, attack the evidence and claimed miracles; post-1983 canonization acceleration (~300 in 1587-1983 vs ~500 canonized and ~1,300 beatified 1983-2005; John Paul II about 5x his 20th-century predecessors). Britannica, "Devil's advocate"; Wikipedia, "Devil's advocate." The adversarial (common-law) legal system, partisan prosecution/defense before a neutral judge/jury; standard legal-systems reference. K. Popper, The Logic of Scientific Discovery (1934/1959), falsifiability; theories validated by surviving refutation, not by confirmation. I. Goodfellow et al., Generative Adversarial Nets (2014), generator vs discriminator, a two-player zero-sum game; mode collapse when the critic is outmatched. G. Irving, P. Christiano, D. Amodei, AI Safety via Debate (arXiv:1805.00899, 2018), two agents argue opposite answers, a weaker judge decides; "the right answer is easier to recognize than to generate"; adversarial supervision of unreliable experts.

"What adversary tested this, and could it have won?" Make the answer checkable.

The essay's question only protects you if a reviewer can actually see what tested an agent's output and whether the test was real. A bare answer carries no adversary and no verdict. Chain of Consciousness is the tamper-evident record of what an agent actually did to reach a result, the reasoning, the checks, the challenges it survived, so the verdict comes with its evidence attached instead of asking you to take a confident generation on faith. The friction is the trust; the record is how you prove the friction happened.

See Hosted Chain of Consciousness · See a verified action chain

pip install chain-of-consciousness · npm install chain-of-consciousness

← Back to all posts