A teardown. The technique has a name, a real pedigree, and a precise failure mode you can derive from first principles and then watch happen in your logs. It is a safety mechanism that works when you're already safe and abandons you when you're not.
You wrap your model call in a loop. Sample it seven times, take the majority answer, ship it with a little more confidence than a single shot would earn. It feels like diligence: more votes, more certainty, the wisdom of crowds applied to a language model. And on most of your test cases it nudges the numbers up, so you leave it in.
Then you look at the one question that actually mattered, the hard one, the edge case, the thing you most needed the system to get right, and you find this: six of the seven samples came back with the same wrong answer. Your vote didn't catch the mistake. It ratified it. It took a confident error and handed it a quorum, and the majority-of-seven wrapper made you trust the wrong answer more than you would have trusted a single guess.
That technique has a name, self-consistency, introduced by Xuezhi Wang and colleagues at Google in 2022, and it is one of the most widely deployed reliability tricks in production LLM systems. It also has a precise, knowable failure mode, the kind you can derive from first principles and then watch happen in your logs. So let's put it on the bench and take it apart.
Self-consistency is simple and, in its place, genuinely good. Instead of greedily decoding one answer, you sample several different reasoning paths from the model at nonzero temperature, then take the majority final answer. The original paper, "Self-Consistency Improves Chain of Thought Reasoning in Language Models," now an ICLR 2023 result, reported real, replicated, double-digit accuracy jumps on arithmetic, commonsense, and symbolic reasoning benchmarks. This is not a strawman. The technique works.
The teardown is not "it's useless." It's that it works in a narrow regime and is applied far outside it, and the single most useful thing you can know about it is exactly where that regime ends. To find the boundary, we need three tools, each borrowed from a different century, and each pointing at the same missing part.
The math that "ask five times and vote" implicitly invokes is 240 years old. The Marquis de Condorcet proved in 1785 that if you have a panel of voters who are (a) statistically independent and (b) individually competent, each more likely than not to be right, p > ½, then the accuracy of their majority vote climbs toward certainty as you add voters. More votes, more truth. That theorem is the dream sitting underneath the loop you wrote.
But it has two premises, and LLM sampling violates both, and it violates them worst on exactly the questions you care about.
Start with independence. Condorcet's jury are independent minds. Seven samples from one model, behind one prompt, are nothing of the sort: they are maximally correlated voters, all drawing from the same well, the model's own probability distribution. The political-science literature on correlated voting is blunt about what this does: without independence "Condorcet's theorem is no longer true," the effectiveness of majority rule "decreases as the correlation between votes increases," and when voters share an information source or follow an opinion leader, the positive correlation can erode or even reverse the Condorcet effect. Your seven samples don't follow an opinion leader. They are one, sampled seven times.
Now the second premise, competence. On an easy question, the model is right more often than chance, p > ½, and the wrong samples are scattered noise that the majority drowns out. Voting helps. But on a hard question, the kind the model gets systematically wrong, its per-sample accuracy drops below one-half. And below one-half, Condorcet's theorem doesn't just stop helping: it runs in reverse. With p < ½, adding voters drives the majority's accuracy toward zero. The same math that guarantees truth for a competent jury guarantees confident falsehood for an incompetent one. People cite Condorcet's promise while quietly violating his premise, and the violation bites hardest precisely where they needed the promise most.
Make it exact. The expected error of an averaged or voted ensemble decomposes into three terms:
error ≈ bias2 + (1/N)·variance + (1 − 1/N)·covariance
where covariance measures how correlated the members are with each other. Now watch, carefully, what voting is actually able to move.
Averaging N predictions divides the variance term by N. As you add samples, that term marches toward zero. Good. But it does nothing to the bias2 term, the model's systematic, baked-in error, the part that's wrong the same way every time. And it does nothing to the covariance term unless you decorrelate the members. For N samples of a single model, the variance term, random decoding jitter, is usually the small part of the error, and it's the only part that dies. The bias2 survives completely untouched. And the covariance term stays large, because the samples are near-copies of one another.
So voting removes the cheap noise and leaves the expensive error sitting exactly where it was. That is "barely works," compressed into one line of algebra. It is also why random forests go to such deliberate trouble to randomize their features at every split: the whole point is to force the covariance term down, because that's the only knob that actually moves the error. "Ask five times" decorrelates nothing. It is bagging over one model's dice rolls, and bagging, as every textbook says, reduces variance and never touches bias.
The folk justification for voting is the wisdom of crowds: Francis Galton's famous 1907 observation that the average of 787 fair-goers' guesses at an ox's weight came within a pound of the truth, beating the cattle experts. But re-read why it worked. The guesses were independent, and their errors pointed in every direction and cancelled. The modern statement, the "diversity prediction theorem," makes the dependence explicit:
crowd error = average individual error − diversity
Diversity is a literal term you subtract. Drive it to zero and the crowd is exactly as wrong as its average member, no better. And N samples of one model is a zero-diversity crowd. It is not a thousand villagers guessing independently; it is one villager, asked to guess five times, writing down a slightly different number each round because his hand shook. There is no wisdom in that. It's a herd of one.
Three lenses, one prediction: sampling a single model many times reduces noise, not error, and on hard questions it amplifies the model's own bias. The 2024 to 2025 LLM literature says exactly that. One 2025 analysis finds that sampling multiple paths from a single model yields "correlated but systematically flawed paths," and that the errors self-consistency amplifies are "systemic and correlated, originating from the model's fundamental bias rather than the aggregation process." Gains plateau fast because new samples "overlap prior reasoning paths": you are re-rolling the same distribution, not gathering new evidence.
The intuition is worth making concrete, because it's the whole reason "more" stops helping. Suppose, on some question, the model lands on the correct reasoning path 45% of the time and on one particular wrong path 40% of the time, with the rest scattered. After a handful of samples you have already seen both the 45% answer and the 40% answer; every additional sample just re-confirms that same split. You are not discovering new candidate answers: the model's distribution only has a few, and you exhausted them early. Worse, the majority of that split is the wrong answer, so the vote confidently returns it, and sampling a hundred more times only makes the 40-beats-45 verdict more stable. The compute climbs without bound; the information stops arriving almost immediately.
And then the finding that should retire the "more samples is always safer" reflex: the returns are not merely diminishing but, in places, negative. A 2024 TACL paper, "Self-Consistency Falls Short!", reports performance declining at high sample counts, and shows that on long-context tasks more sampling amplifies the model's positional bias, making it fixate ever harder on "positionally favored but incorrect documents." More voting made it more wrong, more confidently.
The signature to watch for in your own logs has a name in that work, the "overconfidence sample," where six of seven responses come back identically incorrect. That is not a confidence signal. It is a correlated-failure alarm. It is a model echoing itself and a vote-counter mistaking the echo for a jury.
Put it together and you get the uncomfortable summary that organizes the whole teardown. "Ask N times and vote" helps most on the questions where the model is already right on average and its mistakes are random jitter, the easy ones, where you barely needed the help. And it fails, or actively backfires, on the questions where the model is confidently, systematically wrong, the hard ones, where you needed help most, and where voting simply issues the wrong answer a quorum.
It is a safety mechanism that works when you're already safe and abandons you precisely when you're not. That's not a minor caveat. For a reliability technique, it's the whole ballgame.
Every lens points at the same absent component, so the fix is not subtle: independence. You cannot vote your way past a model's competence ceiling; to exceed what the model already knows, you have to inject a signal the model did not already contain. Two real moves, in increasing strength.
Decorrelate the voters. If you're going to spend N× the compute, don't spend it on N copies of the same prompt at the same temperature. Vary the prompt, the framing, the temperature, ideally the model. Diverse prompts "produce less-correlated votes that aggregate more stably," which is, once again, the random-forest move: drive the covariance term down. The catch is that the diversity has to be real, genuinely different angles of attack, not cosmetic. The same prompt with a fresh random seed buys you almost nothing, because it doesn't move the covariance term at all.
Verify, don't just vote. Instead of counting how many times the model agreed with itself, rank the N candidates with an independent verifier: a reward model, a unit test, a calculator or other tool, ground truth where you have it, or a different model. The load-bearing word is independent: a verifier that does not see the prior decisions provides a genuinely external reference and, in the literature's phrase, "mitigates confirmation bias," breaking the "trust the model's own majority" loop. And the payoff shows up exactly where voting failed: list-wise verification has been measured beating majority vote by 2 to 3 points, "most pronounced on challenging tasks." Even a cheap half-step helps: weighting each sample by the model's own logit-confidence (self-certainty) beats naive one-vote-each majority for free.
One honest caveat, so this doesn't curdle into "just add a verifier and you're safe": a verifier is its own fallible signal. Optimize best-of-N hard enough against an imperfect verifier and the policy learns to game it, Goodhart's law, the same reward-hacking that haunts every system that optimizes a proxy instead of the true goal. You are not buying certainty. You are trading a maximally correlated signal for a less correlated, independently sourced one, which is strictly better, and still not free.
Carry one sentence out of this teardown: voting sharpens the model's distribution; it cannot extend it. Resampling makes the answer the model already believes show up more crisply, wonderful when that answer is right, useless or harmful when it's wrong. So before you wrap a call in a vote-loop:
First, diagnose your failures as variance or bias. Look at the wrong answers: are they scattered (different each run, that's variance, and voting will genuinely help) or clustered (the same wrong answer, run after run, that's bias, and voting will only make it louder)? You can read this straight off your logs, and it tells you in thirty seconds whether the loop you're about to write is worth anything.
Second, if you're going to spend N× the compute, spend it on independence, not repetition: N different prompts or models, or one good external verifier, never N rolls of the same die.
Third, treat "six of seven identically wrong" not as reassurance but as an alarm. A crowd is only wise when it is actually a crowd. Five samples of one model behind one prompt is not a jury, not a crowd, not a panel of experts. It is one die, thrown five times, shouting its single number a little louder each throw.
The trick was never "more samples." It was "more independence," and the cheapest way to fool yourself in this entire field is to buy the first while believing you bought the second.
"Verify, don't just vote" only protects you if the verification is real and recorded.
The whole repair is independence: rank candidates with a verifier that did not see the prior decisions, and the cheapest self-deception is buying more samples while believing you bought more independence. That only stays honest if a reviewer can see what actually happened. Chain of Consciousness is the tamper-evident record of what an agent did to reach a result: the candidates it generated, the verifier it ran, the check that did or didn't fire. It turns "we used self-consistency" into a record a reviewer can audit for whether real independent signal was injected, instead of a majority echo wearing a verdict's clothes.
See Hosted Chain of Consciousness · See a verified action chain
pip install chain-of-consciousness · npm install chain-of-consciousness