← Back to blog

Koch's Postulates: Correlation Isn't Cause, and Here's the 4-Step Proof That This Dependency Is Actually the Culprit

In 1882 Robert Koch didn't find a germ near a disease. He manufactured the disease on demand out of the isolated germ. That distance, between a suspect and a conviction, is one most postmortems never travel. The proof is in do(), not the footage.

Published June 2026 · 12 min read

On the evening of March 24, 1882, Robert Koch stood in front of the Berlin Physiological Society and did something stranger than discovering a germ. Plenty of people had seen bacteria sitting in diseased tissue before; microscopes had been spotting little rods and dots near sick bodies for two centuries. Finding a microbe at the scene of a disease was not news. What Koch did was harder, and far more convincing.

He had taken the slender rod-shaped bacterium out of tubercular tissue and grown it by itself, on a slab of heat-solidified blood serum, alive, multiplying, with no host anywhere near it. Then he injected that pure culture into healthy guinea pigs, and the guinea pigs came down with tuberculosis. Then he opened them up and found the very same bacterium waiting inside.

He hadn't found a germ near a disease. He had manufactured the disease, on demand, out of the isolated germ. That is the entire distance between a suspect and a conviction, and it is a distance most engineering postmortems never travel.

Four rules for proving cause

Within a couple of years, Koch and his collaborator Friedrich Loeffler had hardened that demonstration into a checklist. We call them Koch's postulates today, though they're sometimes called the Henle–Koch postulates, because Koch's old teacher Jakob Henle had sketched the logic back in 1840 and simply lacked the tools to prove a word of it. (Koch refined the wording again in an 1890 address in Berlin.) To claim that a specific microbe causes a specific disease, the postulates demand four things:

  1. Present in every case. The microbe is found in every diseased host and absent from healthy ones.
  2. Isolated in pure culture. You can extract it and grow it alone, separated from the host.
  3. Reproduces the disease. The cultured microbe, introduced into a healthy host, makes the disease appear.
  4. Re-isolated. You recover the same microbe from that newly sickened host.

These rules are about 140 years old. They are still a stricter standard for root cause than the one your team used in last week's incident review, and the reason is hiding in which postulate does the real work.

Read them again as a debugging protocol. Present in every case: the suspect appears in every incident and never in a healthy run. Isolated in pure culture: you can reproduce the failure in a clean environment with only the suspect varying. Reproduces the disease: you introduce the suspect into a healthy system and it fails. Re-isolated: the failure you reproduced is the same failure, by the same mechanism, not a look-alike.

Now compare that to how root cause usually gets assigned. Something broke. Someone scrolled the logs, found the dependency that was timing out, noticed it was "present and broken during the incident," and wrote it up as the root cause. A fix shipped. Everyone went to lunch.

That is the exact move Koch's whole career was a rebuke to. The timing-out dependency is the microbe in the patient. It was there. Lots of things were there. Being present at the scene is postulate 1 on a generous day, and postulate 1, it turns out, is the weak one.

Necessary, sufficient, and the rung you skipped

The four postulates are really testing two different logical claims, and it pays to keep them straight.

Postulate 1 is a test of necessity. If the disease ever shows up without the microbe, the microbe isn't necessary, it's a fellow traveler. In debugging: if the system ever failed without the suspect, the suspect isn't the cause. This is why "that service hasn't changed in two years" is not an alibi. Koch learned the limits of necessity the hard way: when he went after cholera in 1883 and 1884, he found healthy people walking around shedding the cholera bacillus, asymptomatic carriers, and was honest enough to quietly drop the universal version of his own first postulate. The pathogen was present in perfectly healthy hosts. Software has the same creatures: latent bugs that live in a healthy system for years, doing nothing, until a load pattern or a feature flag finally wakes them. "Present in a healthy host" breaks necessity in a Petri dish and in production alike.

Postulate 3 is a test of sufficiency, and it is the one that convicts. The suspect, introduced on its own into a healthy host, produces the failure. Not "was nearby when the failure happened", produces it, now, because you made it.

The cleanest way to see why these are different rungs of knowledge comes from the computer scientist Judea Pearl, who spent decades formalizing exactly this. In The Book of Why (2018), Pearl describes a "ladder of causation" with three rungs: seeing (association, what's correlated with what), doing (intervention, what happens when I act), and imagining (counterfactuals). The jump from rung one to rung two is the jump from P(failure | suspect is present) to P(failure | do(introduce the suspect)), from watching to intervening. They are not the same quantity, and no amount of careful watching ever turns into doing.

Postulate 1 lives on rung one. Postulate 3 lives on rung two. Correlation, "the dependency was present and broken during the incident", can climb to the top of rung one and no higher. It will never reach the thing reintroduction reaches, for the same reason a security camera that filmed someone near a burglary hasn't proven they did it. The conviction isn't in the footage. It's in do().

You already own the instrument

Here's the good news: every engineer already has a tool that reaches rung two, and most of us use it without noticing it's Koch's third postulate in disguise.

git bisect walks the commit history with a binary search until it hands you the single commit that introduced a bug. But finding the commit is only postulate 1, present at the scene. The conviction comes from the next two moves, which are so routine we forget they're profound: you revert that commit and the bug vanishes, and then, this is the part that matters, you re-apply it and the bug comes back. Gone, then back, on your command. That is reintroduction. That is "I can make it fail by toggling this." That is the guinea pig.

The "I can make it fail by toggling this" test is the gold standard precisely because co-occurring evidence can never reach it. A log line, a flame graph, a suspicious diff, a dependency that was red on the dashboard, these are all rung-one evidence, and they top out at suspicion. The toggle is rung two. When you can flip the suspect on and the failure appears, flip it off and the failure leaves, flip it on again and it returns, you have done what Koch did to that bacillus, and you are allowed to use the word cause.

And don't skip postulate 4. Re-isolation guards against a trap that's sneaky in software: reproducing a failure that looks like the failure. You toggled something, an error appeared, the error message matched, case closed? Not yet. Confirm it's the same failure by the same mechanism, not a different bug wearing the same stack trace. A green "I reproduced it" can be a coincidence dressed as a confirmation. Koch checked that the microbe he pulled out of the guinea pig was the one he put in. Check that the bug you reproduced is the one you're hunting.

The reason this isn't pedantry

So what does it cost to convict on correlation? The same incident, next Tuesday.

This is the part teams underrate. A fix shipped against an unconvicted suspect isn't a fix, it's a coincidence you're hoping holds. You removed the thing that was merely present, the actual cause is still live in the system, and the failure is still loaded in the chamber. John Snow's nineteenth-century lesson was that if you blame the wrong thing, the pump handle stays on the wrong pump and people keep getting sick. The engineering version is quieter but identical: you "fixed" the memory pressure by adding RAM because a restart made the symptom go away, and three weeks later the connection-pool leak that was the real cause takes the service down again, at 3 a.m., for the same forty minutes.

The recurrence is the tell. When an incident comes back wearing the same face after you "fixed" it, the overwhelmingly likely explanation is that you convicted a correlate. You treated a suspect as a culprit, removed the wrong thing, and left the pathogen in the water.

When you can't reach a conviction

Now the honest complication, because rigor that only works on easy cases isn't rigor.

Some pathogens won't grow in a dish. A virus can't be cultured alone at all, it needs a living host cell to replicate, so postulate 2 was simply impossible for an entire class of disease, and the early bacteriologists knew it. Debugging has the same unculturable cases: race conditions, heisenbugs that evaporate the moment you attach a debugger, the failure that only happens in production under real traffic and refuses to show up in staging. You cannot isolate it. You cannot toggle it on demand. Postulate 3 is off the table.

The unculturable agent isn't the only way the postulates break, and each other break-case is its own routing signal. Some diseases are polymicrobial: no single organism reproduces the illness because the cause lives in the consortium, the way a cascade can live in the interaction between service A's reasonable retry policy and service B's reasonable rate limit while every component tests innocent alone. Run the reproduction on any single suspect and it comes back clear, because individually it is; stop hunting units and look at the interaction. And then there is the prion, the misfolded protein behind mad-cow disease that carries no nucleic acid at all and breaks every postulate because the framework assumed the wrong kind of culprit. Its engineering twin is the most humbling incident there is: you are deep in the application code and the cause was never code, it was a clock skew, a data corruption, a dependency three layers down. You will never clear a postulate about your code, because your code is not the pathogen. When no single component is guilty, go systems-level; when nothing in the code satisfies any postulate, you are in the wrong category, so go look at config, data, and infrastructure.

The wrong response is to lower the bar, to convict the nearest suspect anyway because the postmortem is due. The right response is the one epidemiology worked out a century after Koch: when you can't run the experiment, switch to a weaker but still principled standard. There are two rungs down, and they're both usable.

The first is molecular Koch's postulates, which the microbiologist Stanley Falkow proposed in 1988. His insight: if you can't grow the whole organism, go after the responsible gene instead. Inactivate the specific gene and show the pathogenicity disappears; restore the gene and show it comes back. You've convicted the mechanism without ever culturing the organism. The debugging translation is immediate and powerful: you may not be able to reproduce the entire failure cascade, but you can often flip the one feature flag, drain the one node, or revert the one config line and watch the error rate move with it. Targeted ablation gets you a conviction on the gene even when the organism stays in the wild.

The second, for when you can't even do that, is Bradford Hill's criteria, from a 1965 address by the British epidemiologist Sir Austin Bradford Hill. Hill built his nine "viewpoints" for the exact situation where Koch's third postulate is impossible: he was arguing that smoking causes lung cancer, and you obviously cannot assign a few thousand healthy people to smoke for thirty years and watch. When you can't intervene, what's left? Of Hill's nine, two do most of the work for engineers. The first is temporality, the suspect must precede the failure, which Hill considered the one non-negotiable. It's also the one teams botch most: blaming a deploy that actually shipped after the latency spike began. Check the clock before you check anything else. The second is biological gradient, or dose-response: more of the suspect should mean more of the failure. Route 5% of traffic to the suspect node, then 20%, and watch whether the error rate climbs with the dose. A clean dose-response curve is the strongest causal evidence you can get without a do(), and it's often sitting right there in your traffic-shaping tools. Strength, consistency, and plausibility fill in around those two.

This is a ladder, and the discipline is to climb down it deliberately rather than fall off it. Conviction by reproduction (Koch's postulate 3) is the top. Targeted ablation (Falkow) is the next rung. Population-level inference (Hill, with temporality and dose-response leading) is the rung below that. What you don't get to do is stand on the ground floor, "it was present and broken during the incident", and call it the top.

RungStandardThe engineering move
Top: convictionKoch postulate 3 (reproduce on demand)Revert and re-apply: make it fail by toggling
Down oneFalkow: molecular / targeted ablationFlip the one flag, drain the one node, revert the one line
Down twoBradford Hill: population inferenceTemporality (precedes) + dose-response (5% then 20%)
Ground floor: not a causePostulate 1 only (present at the scene)"Present and broken during the incident" = suspected

The one sentence that stops next week's outage

Here's the checklist to keep next to your incident template. Three questions, in order:

  1. Necessity. Did the failure ever happen without this suspect? If yes, it's a correlate. Keep looking.
  2. Sufficiency. Can you make the failure happen on demand by reintroducing the suspect, the toggle, the revert-and-reapply, the dialed-up dose? If yes, you have a culprit.
  3. Reversibility. Remove it. Does the failure stop, and stay stopped?

If you can clear all three, write "confirmed root cause" and mean it. But if you can't answer the second one, if everything you have is rung-one evidence, the dependency that was simply present and broken, then the most rigorous thing you can put in the document is a single, unglamorous word: suspected. Mark it as a hypothesis, ship a mitigation if you must, and keep the investigation open.

Because the discipline Koch handed to medicine was never the promise that you can always reach proof. Cholera carriers, unculturable viruses, and his own walked-back first postulate are proof that he couldn't, and he said so. The discipline was the honesty to know the difference between the germ you found near the patient and the germ you proved makes people sick, and to never, ever confuse the two on the page.

"Correlated, not confirmed" is an uncomfortable line to write under pressure, with a director asking for a root cause and the clock running. It is also the single sentence most likely to keep you from holding this exact same retro again next Tuesday. Koch convicted his bacillus by making a healthy animal sick on command. Until you've done the equivalent, until you can make it fail by toggling this, you don't have a culprit. You have a suspect you haven't tried hard enough to clear.

"Confirmed root cause" should carry the do() behind it, not just the correlation.

When an agent writes up a cause, the difference between rung one and rung two is whether it actually toggled the suspect or only watched it co-occur, and that difference has to be recorded to be trusted. Chain of Consciousness keeps that record: a tamper-evident trace of what an agent observed, what it intervened on, and what changed, so a "confirmed" verdict comes with the reproduction behind it instead of standing in for it.

pip install chain-of-consciousness · npm install chain-of-consciousness
Hosted Chain of Consciousness → · See it in action