Why your redundant replicas all assume someone else will handle it — and the century-old fix.
In 1913, a French agricultural engineer named Maximilien Ringelmann tied a rope to a dynamometer and asked men to pull on it — first alone, then in groups — while he measured the force. He was trying to understand work output in oxen, horses, and men, the practical economics of who could pull a plough. What he found instead is the oldest result in social psychology, and it reads like a distributed-systems incident report written sixty years before distributed systems existed.
One man pulling alone set the baseline: call it 100% of his capacity. Two men on the rope pulled at 93% each. Three pulled at 85% each. By the time eight men hauled on the same rope, each was contributing just 49% of what he produced alone — barely half. The rope does not lie. Add a body, and the force per body drops. Ringelmann had measured something nobody asked for: as a group grows, each member quietly does less, and the total comes in far below the sum of its parts.
If you have ever stood up a five-replica cluster and watched it deliver nothing close to five replicas' worth of reliability, you already know this finding in your bones. You just learned it from a postmortem instead of a rope.
For a long time the Ringelmann effect was assumed to be a coordination problem — eight men on a rope get in each other's way, pull slightly out of sync, waste force fighting the angles. Reasonable. But in 1979, the psychologist Bibb Latané, with Kipling Williams and Stephen Harkins, ran the experiment that separated the two explanations and gave the phenomenon its enduring name: social loafing.
Their design was clever. They blindfolded participants, put headphones on them playing noise so they could not hear anyone else, and asked them to shout or clap as loudly as possible. Each person was sometimes told they were performing alone and sometimes told they were part of a group — but they always acted in physical isolation, so there was no one to get out of sync with. Coordination loss was impossible by construction. And still, the moment people believed they were part of a group, their individual output dropped. The bigger the group they imagined, the less each person produced.
That was the whole point. The effort drain was not friction between bodies. It was motivation. People worked less hard because they believed others shared the load — not because the others physically interfered. Latané identified two mechanisms underneath it, and you should hold onto both words, because they are about to describe your infrastructure. The first is diffusion of responsibility: when a job belongs to everyone, it weighs on no one — someone else will pull. The second is reduced evaluability: in a group, no one can tell whether you specifically slacked, so the cost of slacking falls to zero. Anonymous and shared is the exact recipe for "not my problem."
The same two mechanisms produce a more alarming failure, and it has a more famous origin story — one worth telling carefully, because the popular version is largely wrong.
In 1964, a woman named Kitty Genovese was murdered in Queens, and The New York Times reported that 38 respectable, law-abiding citizens watched from their windows and did nothing. That number, and that tidy image of 38 frozen witnesses, has since been substantially debunked — later investigation found far fewer people saw the whole attack, some did call police, and the "38 who did nothing" framing was closer to myth than record. But the story, true or not, sent two psychologists named John Darley and Bibb Latané into the lab, and what they found there is solid in a way the newspaper account never was.
In their 1968 experiment, participants sat alone in rooms, talking with others over an intercom, when one of those others appeared to have a seizure. The variable was how many other people the participant believed were also listening. When a participant thought they were the only one who could hear the emergency, 85% went for help. When they believed four other people were also listening, only 31% did — and those who did help took far longer to move. Same emergency. Same kind of person. Less than half as likely to act, purely because responsibility had been spread across an imagined group.
This is the finding that should be taped to the wall of every team that has ever responded to an outage with the phrase "let's add another one for redundancy." Adding witnesses did not add safety. It added bystanders. Each additional observer made every individual observer less likely to be the one who acted — because surely, with all these others watching, someone else already has it handled.
Hold those two psychology results next to your own systems, and the resemblance stops being cute and starts being uncomfortable.
You run three monitoring systems for triple coverage, and an incident slips through un-paged because each one, in effect, assumed one of the other two had already fired. You keep five replicas for durability, and under a network partition no replica claims ownership of a write because each assumes another will take it. You deploy a pool of retry workers to guarantee delivery, and a failed message sits in the queue unclaimed, because every worker assumes another grabbed it already. You spin up a dozen identical task-runners and a job goes undone, owned by no one, because ownership was never assigned to anyone in particular.
We are trained to diagnose these as coordination failures — and sometimes they are. Network partitions, clock skew, lost heartbeats, the genuine hard problems of getting machines to agree. But Latané's experiment is the uncomfortable reminder that not every redundancy failure is a coordination failure. Some of them are motivation failures wearing a coordination costume. The retry worker that does not retry "because another worker will surely get it" is not confused about the network topology. It is loafing — it has been built into a structure where the work belongs to everyone and therefore to no one. The monitor that stays silent because, statistically, one of its three siblings probably alerted, is not broken. It is a bystander. You assembled a crowd and then acted surprised when no hero stepped out of it.
Diffuse responsibility does not only fail by paralysis. It has a twin failure that looks like the opposite but comes from the same root, and there is a beautiful real-world example of it.
The engineer Arpit Bhayani documented a GitHub outage caused by their ZooKeeper cluster. During scheduled maintenance, too many new nodes were added to the cluster too quickly. Each of those new nodes, booting into a cluster it did not yet understand, reached the same conclusion at the same time: there is no leader here. So each triggered a leader election. Because the freshly-added nodes formed a numerical majority, they elected a second leader — and the cluster now had two nodes each convinced it was in charge. The classic split brain. Roughly 10% of writes to the downstream Kafka cluster failed (data loss was avoided only because a dead-letter queue caught the failed writes).
Look at what happened there against the bystander result, because they are mirror images of the same disease. In the seizure experiment, every witness assumed someone else was acting, so nobody did — paralysis. In the ZooKeeper cluster, every new node assumed nobody was acting, so everybody did — collision. One is a room full of people who all stay seated; the other is a room full of people who all leap up and grab the wheel. Both are what you get when responsibility is not pinned to one identifiable party. Ambiguous ownership does not fail in a single, predictable direction. It fails toward whichever is worse in the moment — silence or a brawl.
Here is the genuinely surprising part. Social psychology did not just diagnose this problem; it found the cures, decades ago. And distributed systems engineering, working in total ignorance of 1968, groped its way to the exact same cures and gave them different names. The mapping is almost suspiciously clean.
Psychology's first and strongest intervention against loafing is making individual contribution identifiable. When people know their specific effort is visible and attributable, loafing largely evaporates. The distributed-systems name for this is leader election — and the consensus protocols got it exactly right. Under Raft or Paxos, exactly one node is the leader at any given time, and the load-bearing detail is not just that there is one leader; it is that the leader knows it is the leader, and every other node knows it too. "Any node can take the write" is a crowd on a rope. "This node owns the write, and all the others know they do not" is a named worker who cannot hide.
The single most effective bystander intervention ever found is even simpler, and every first-aid course now teaches it: do not shout "somebody call for help." Point at one specific person — "you, in the blue jacket, call 911." Naming one individual collapses the diffusion instantly, because responsibility now has exactly one address. The systems translation is explicit single-owner assignment: one designated health-checker per service rather than N interchangeable ones; one consumer that owns this partition rather than every consumer reading everything; one human primary on the pager, named, rather than an alert sent to "the team," which is the on-call equivalent of shouting into a crowd.
Psychology's second intervention is keep the responsible group small — two or three people help far more reliably than ten. Systems got there as the quorum: you do not need all five nodes to agree to make a decision safe; you need a defined majority subset. A quorum of three that must actively agree is more reliable than five that each assume the others have consensus handled. Small, bounded, explicitly responsible beats large and diffuse — Latané's group-size finding, rederived as a voting rule.
And psychology's other half of the loafing mechanism, reduced evaluability, has its own direct fix: reduce anonymity, make each contribution legible. In systems this is per-node metrics and distributed tracing. When an incident is attributed to "the cluster," no one loafed and no one is accountable, which is precisely the condition that lets it happen again. When the trace says node-7 dropped the write, node-7 has an owner, and the owner has a Monday morning. Tracing is "reduce anonymity" implemented in software.
There is a fair challenge to all of this, and it comes from the psychology literature itself. In 2019, a team led by Richard Philpot analyzed real CCTV footage of public violent conflicts across three countries and found that bystanders intervened in over 90% of cases. The lab-built image of frozen, apathetic witnesses dramatically overstates how people behave in actual emergencies on real streets.
That sounds like it should weaken the whole argument. It does the opposite. Look at which conditions produce inaction and which produce help. The bystander effect bites hardest exactly where Darley and Latané manufactured it: anonymous strangers, no social bonds, ambiguous and unassigned responsibility, no one identifiable. Real streets are full of the opposite — people with social ties, mutual recognition, a felt sense of "this is mine to handle." The 90% intervention rate is not evidence that diffusion of responsibility is fake. It is evidence that the cure works in the wild: where responsibility is legible and felt, people act. Which is a precise description of the two states your architecture can be in. An under-designed system is the lab: anonymous, interchangeable nodes with no assigned ownership — and it freezes. A well-designed one has named leaders, owned partitions, traced requests — identifiable responsibility — and it acts. The knob is the same knob.
So here is the reframe to carry back to your next design review, and it is blunt on purpose: adding redundancy without assigning identifiable responsibility does not add safety. It adds bystanders. Three monitors that each assume the others paged are not three times as safe as one — they may be less safe than one, because the single monitor at least knew the job was unambiguously its own.
The practical test takes one sentence per redundant component. For each one, stop asking "how many of these do I have" and start asking "which exact one is responsible right now, and does it know that it is?" If the honest answer is "all of them" or "whichever one notices first," you have built a crowd on a rope, and Ringelmann already measured what that gets you. If the answer is a single, named, self-aware owner — this leader, this partition, this on-call human — you have converted a diffuse group into an identifiable one, which is the only move that has ever reliably defeated this failure in either people or machines.
Ringelmann measured it on a rope in 1913. Darley and Latané proved the mechanism in 1968. You redeploy it every time you scale a service to N replicas and call it resilient without ever saying which one is in charge. The science is more than a century old, and the fix has not changed: point at exactly one node, and say you.
In an agent fleet, diffusion of responsibility is an attribution problem.
Spin up a dozen interchangeable agents and a task goes undone, owned by no one — the crowd on the rope, again. The cure the essay keeps circling is identifiability: "when the trace says node-7 dropped the write, node-7 has an owner." Chain-of-Consciousness is that, for agents: every action carries a named, signed, tamper-evident owner, so "someone else will handle it" becomes "agent-7 owned this, and we can prove it" — the legible, attributable responsibility that turns a crowd into a worker who cannot hide.
pip install chain-of-consciousness · npm install chain-of-consciousness
Hosted Chain-of-Consciousness → · See it in action