Why “review after the first comment” destroys your crowd. A team estimate is the average of independent judgments; the moment the first one is visible, you have an echo with a quorum.
Watch what happens in the next estimation meeting you sit in. The team is about to size a piece of work, and before the cards come out (or before anyone has really thought about it) the staff engineer says, almost offhand, “this is a two-day task, tops.” Now watch the numbers. The person who privately suspected it was a week shaves it to three days. The one who thought two felt vindicated and says so loudly. The junior who has never touched this part of the system, and who might have asked the one question that revealed the hidden migration, says nothing and writes down “2.” Forty minutes later you have a team estimate of two days. It is wrong, and it is wrong in a specific, diagnosable way: it isn't the average of five judgments. It's one judgment with four co-signers.
Solomon Asch could have told you this would happen, and he could have told you in 1951. In his now-famous line-judgment experiments, a subject sat in a room with several other people (all secretly confederates) and was asked to say which of three lines matched a reference line. The answer was obvious; the lines weren't close. But the confederates, one after another, confidently gave the same wrong answer. And about three-quarters of real subjects went along with the obviously wrong group answer at least once. Across the critical trials, people conformed to the wrong answer roughly a third of the time. Only about a quarter of subjects held the line every single time. The detail that should haunt every team lead is this: in the interviews afterward, many of the conformers knew the group was wrong. They saw the correct line as plainly as you'd see it. They went along anyway, to avoid friction, to not be the odd one out, to not make it weird.
If people will abandon the evidence of their own eyes about the length of a line, to avoid mild social awkwardness, in front of strangers they will never see again, then your junior engineer is absolutely going to defer to the staff engineer's confident first number about a system the staff engineer designed. You are not running an estimation meeting. You are running an Asch experiment, on purpose, every sprint.
The wisdom-of-crowds idea, popularized by James Surowiecki's 2004 book, rests on four conditions: diversity of opinion, independence (each person's judgment isn't determined by those around them), decentralization, and a good way to aggregate the judgments. When all four hold, something genuinely surprising happens: the group's combined judgment tends to beat even its smartest individual member. The canonical demonstration is Francis Galton's, from a country fair in Plymouth in 1906. Some 787 people paid to guess the weight of an ox. Galton, who expected to prove the public foolish, collected the tickets and found that the median guess was 1,207 pounds. The ox weighed 1,198. The crowd, averaged, was off by less than one percent, better than the cattle experts in the room.
Here is the part everyone quotes and almost no one operationalizes: why does the averaging work? It works because the errors are uncorrelated. Some people guessed too high, some too low, for a thousand idiosyncratic reasons, and when you pool them the overshoots and undershoots cancel and what's left standing is the signal. The crowd is smart not because everyone is right but because everyone is wrong in different directions. That cancellation is the entire mechanism. It is also extraordinarily fragile, and independence is the condition that protects it.
Because the instant you introduce a single shared anchor (one number, one theory, one preferred option that everyone sees before forming their own view) the errors stop being uncorrelated. They all tilt the same way, toward the anchor. The overshoots and undershoots no longer cancel; they pile up on one side. And the moment that happens, you no longer have 787 guesses, or five estimates. You have one guess, echoed back to you with the volume turned up. The aggregation machinery keeps running, but it's now averaging copies. This is why independence is the keystone of the four conditions: diversity, decentralization, and aggregation all do nothing if the judgments being aggregated were quietly correlated before they were collected.
What Asch really measured was the price of independence, and it turns out to be heartbreakingly cheap. Social scientists call the failure mode an information cascade: a situation where one person's visible judgment sets off a chain in which everyone after defers to what came before. Cascades arrive by two routes, and engineering rooms get hit by both at once.
The first route is the power differential, sometimes bluntly called the HiPPO problem, the Highest Paid Person's Opinion. When authority, seniority, or simply the loudest and most confident voice speaks first, everyone downstream weights that opinion above their own. The second route is anchoring: the first concrete number, theory, or option that lands becomes the reference point the whole discussion orbits, regardless of whether it's any good. Now notice what the staff engineer's casual “two-day task” actually is. It is a power-differential cascade and an anchor, fired simultaneously. The seniority tells you to defer; the specific number gives you something to defer to. It would be hard to design a more efficient independence-destroyer if you tried.
And engineering destroys independence everywhere, not just in estimation. The pattern is the same cascade wearing different clothes:
Each of these takes a genuinely independent crowd (a group of people who, polled separately, would have surfaced different risks, different numbers, different failure modes) and converts it into a correlated one. Each forfeits the diversity that made the group smart in the first place, and trades it for the comfortable feeling of fast agreement.
Here's where a naive reading goes wrong, so let's not take it. The fix is not to lock everyone in separate rooms and ban discussion forever. Pure, permanent independence throws away the entire point of having a team: the real information-sharing, the “wait, did anyone account for the data migration?” that only happens when people talk. And the research is clear that independence can be relaxed productively: structured discussion after people have formed their own positions improves both the group's estimate and the individuals' own revised estimates. Conversation isn't the enemy.
The enemy is the order. The failure mode is not “people discussed.” It is “the first opinion was seen before the others were formed.” The discipline that captures the benefits of both independence and discussion is a sequence, and it has a name: form, then share, never share, then form. Have each person commit a judgment privately, in writing, before any opinion is visible. Then reveal them, ideally with the reasoning attached and the names stripped off. Then discuss. Then let people privately revise. This is, more or less, the Delphi method, developed at the RAND Corporation in the 1950s and 60s precisely to get expert groups to forecast without the loudest expert capturing the room, and it reliably outperforms both isolated individuals and an open free-for-all. The independence isn't sacrificed; it's front-loaded. You protect the moment the judgments are formed, and only then do you let them touch.
Planning poker, which feels like a quirky agile ritual, is exactly this insight compressed into a card game. The reason everyone reveals their estimate simultaneously (flipping the cards at once rather than going around the table) is not whimsy. It is the single mechanism that keeps the data points independent. As Rework's guidance on the practice puts it, if estimates came out one at a time, each person would be influenced by the numbers already visible; simultaneous reveal is what gives the group “genuinely independent data points.” The same source recommends a small piece of choreography that is pure Asch-prevention: ask the person holding the outlier estimate to speak first, rather than letting the tech lead anchor the room early. Get the dissent on the table before the anchor can land.
The first is about who you think the problem is. The instinct, when a team keeps folding to the senior voice, is to fix the people: hire braver engineers, tell the juniors to speak up, have the lead say warmly, “please, disagree with me freely.” None of that works, and Asch tells you why. His subjects weren't fooled and they weren't cowards in any meaningful sense: they knew the answer and conformed anyway, because the pull was social and structural, not a failure of perception or nerve. A senior engineer saying “feel free to disagree with me” does not undo the cascade, because the cascade isn't about permission; it's about sequence. The same senior engineer simply speaking last undoes it completely. The fixes are structural (order, anonymity, simultaneity) not motivational. The one reliable escape hatch Asch himself found points the same way: conformity collapsed dramatically when even a single other person broke from the group and gave the right answer. A lone ally is enough to free the dissenter. That is the entire empirical case for assigning a devil's advocate: for making dissent a designated role someone is obligated to play, rather than an act of individual courage you hope someone finds.
The second diagnostic is about how you read agreement, and it's the one that should make you a little uncomfortable. A room that snaps quickly to consensus feels like a strong, aligned team. But you cannot tell a wise crowd from a merely correlated one by looking at how much they agree. If the agreement formed after the first opinion was visible, it isn't corroboration, it's co-signing. It's the Asch result wearing a tie. So stop asking “did we agree?” as if agreement were the evidence. Ask the only question that actually distinguishes a team estimate from an echo: “were these opinions formed independently before they were shared?” Unanimity reached after the anchor dropped is not your success signal. It's your warning light.
The remedies are all structural, all cheap, and all things you can install this week without a single conversation about culture or courage.
Form before you share. Make every participant commit a number, a position, or a hypothesis privately (in a doc, in a chat they all post at once, on a card) before any opinion is visible. This single sequencing rule is most of the fix.
Reveal simultaneously. Planning poker for estimates. For code review, where you can, let reviewers form their take before the senior comment is visible, or simply establish that the senior comments last. For decisions, written pre-reads opened together, not a live opinion that anchors the call.
Sequence the room on purpose. Outliers first, so the dissent exists before anyone has to be brave; the highest-authority and loudest voices last, every time, as policy.
Then discuss, structured, not free-for-all. Once the independent positions are in, share the reasoning (anonymized if you can manage it), argue it out, and let people revise privately before you aggregate. Independence first, then learning, beats both silos and a debate.
Assign the devil's advocate. Make someone responsible for the strongest counter-case, so disagreement is a job, not a personality trait.
And diagnose by sequence, not by agreement. Before you trust a “team consensus,” ask whether the opinions were formed independently before they were shared. If they weren't, you don't have consensus, you have confidence you haven't earned, and you should spend it carefully.
Because here is the whole thing in one line. A team estimate is the average of independent judgments. The moment the first one becomes visible before the rest are formed, you don't have an average anymore, you have an echo with a quorum. If your process lets the first opinion be seen before the others are formed, you don't have a team's judgment at all. You have one person's judgment, and a room full of people who agreed to put their names on it.
A crowd's wisdom needs uncorrelated signals. So does an agent's reputation.
The whole essay turns on one thing: a pooled judgment is only worth more than its best member when the inputs were formed independently. The same trap waits for agent reputation. If a score is built from ratings that cascaded, everyone deferring to the first loud review, you don't have a crowd's verdict; you have one opinion echoed with a quorum. The Agent Rating Protocol builds reputation from verifiable outcomes rather than copy-paste opinion: a portable track record assembled from what an agent actually did across many independent engagements, so the signal is a real crowd of uncorrelated evidence, not an anchor wearing a tie.
Verify an agent's track record
pip install agent-rating-protocol · npm install agent-rating-protocol