The guarantee that “holds” because its precondition never triggers is the most dangerous kind. Every guarantee is secretly an implication, and an implication is true for free whenever its “if” never fires.
On the night of January 31, 2017, a GitLab engineer was fighting a struggling database when he ran a cleanup command against the wrong server. He meant to wipe a misbehaving replica. He was, a second too late, looking at the live production primary. By the time he hit Ctrl-C, about 300 gigabytes of customer data (projects, issues, merge requests, user accounts, the accumulated work of a great many people's day) had evaporated.
That was the bad part. The legendary part came next. GitLab had not one backup strategy but five: regular database dumps, disk snapshots, replication, the belt-and-suspenders works that any responsible team is supposed to have. When they went to restore, they discovered that not one of the five was actually working. The database dumps had been silently failing for months (a version mismatch made the tool produce near-empty files, and the failure notifications had been going somewhere nobody read). The snapshots weren't enabled on the database servers. The rest were misconfigured or empty. Their own postmortem stated it with a bluntness that has been quoted ever since: “out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.” They were rescued only by luck: a single snapshot one engineer had taken by hand, six hours earlier, for unrelated testing. They restored from that and lost six hours of the world's work, live-streaming the entire grim recovery to a horrified audience.
Here is the thing worth pausing on. GitLab did not lack backups on paper. They had five guarantees. And every one of those guarantees was, in a precise and poisonous sense, true, right up until the moment someone needed it. They were true the way “if pigs fly, then I'm a billionaire” is true. And the gap between that kind of true and the kind you actually want is where this whole essay lives, because your system is full of the first kind, wearing the green checkmarks of the second.
Crack open any logic textbook and you hit a small scandal in the first chapter. The material implication “P implies Q” (written P → Q) is false in exactly one situation: when P is true and Q is false. In every other row of the truth table it's true. Which means that whenever P is false, the whole implication comes out true no matter what Q says. “If pigs fly, then I'm a billionaire” is a true statement. Not because I'm rich, because pigs don't fly. The antecedent is false, so the implication is satisfied for free.
Logicians call this vacuous truth, and it reliably trips up everyone the first time, because it feels like getting away with something. It isn't a trick or an error; it's correct, and it has to be that way for logic to work at all. But that correctness is exactly the hazard, because every guarantee you make about a system is secretly an implication. “We handle errors” unfolds to if an error occurs, then we recover. “We have a fallback” means if the primary goes down, the backup takes over. “We meet the SLA” means if the covered event happens, we respond in time. “The retry logic always recovers” means if a transient failure hits, the retry saves us.
Now read those as implications and a cold little question appears. In all your green tests (the suite that's passing right now, the dashboard that's all checkmarks) did the antecedent, the if part, ever actually fire? Did a real error occur? Did the primary actually go down? Did the covered event actually happen? Because if it didn't, your passing test confirmed the cheap half of the truth table. It established “P was false here” (no error happened in this run) which tells you precisely nothing about whether Q is any good. The handler is green. The handler is also untouched. You have proven, with great ceremony, that pigs did not fly.
If that sounds like a philosopher's worry rather than an engineer's, here is the fact that should change your mind: the people who verify hardware for a living discovered this decades ago, gave it a name, and built detectors for it, because it was quietly wrecking their work.
In 1997, four IBM researchers (Ilan Beer, Shoham Ben-David, Cindy Eisner, and Yoav Rodeh) published a paper called “Efficient Detection of Vacuity in Temporal Model Checking.” Model checking is about as rigorous as verification gets: you write a formal specification and a tool exhaustively proves your design obeys it. And these four had noticed something unsettling. Their model checkers kept returning PASS on specifications that were satisfied vacuously, properties that held only because their preconditions could never actually be met. Their canonical example is this entire essay compressed into one line of formal logic. The specification AG(req → AF grant) reads “every request is eventually granted.” It passes (cleanly, greenly, provably) in any system where a request is never sent. The checker isn't malfunctioning; the property genuinely holds. It just holds for the “if pigs fly” reason: no request, so the implication is free.
And at IBM this was not some rare academic edge case. The motivation for the whole line of work was that a significant fraction of the properties their engineers were proudly checking off were passing vacuously, verifying nothing, while everyone moved on satisfied. So they built vacuity detection, and its essential companion, the interesting witness: a tool that, alongside the PASS, hands you a concrete trace in which the antecedent actually fires, so a human can look and confirm the property means something real. Eighteen years later, a 2015 follow-up sharpened the idea to exactly our case and even named it after the symptom, “temporal antecedent failure,” preconditions that are never enabled. The lesson is worth saying plainly: the people who verify the processor your code is running on treat “did the antecedent ever fire?” as a first-class engineering question, with tooling to enforce it. Most software teams never think to ask. (To be fair to the tool: vacuity detection asks the right question. It doesn't fix your spec for you; it flags that a guarantee was satisfied vacuously and leaves the next move to you. But asking is most of the battle, because almost nobody asks.)
This is not a theoretical hazard with a tidy hardware-lab pedigree. It is, by one of the most cited measurements in distributed systems, the dominant way large systems die.
In 2014, a team led by Ding Yuan presented a paper at the USENIX OSDI conference with a title that sounds almost smug until you read the data: “Simple Testing Can Prevent Most Critical Failures.” They took 198 randomly sampled, real-world catastrophic failures (the kind that take down a whole cluster or lose data) from five heavily used systems: Cassandra, HBase, HDFS, Hadoop MapReduce, and Redis. Then they traced every one back to its root cause. The headline finding is the kind of number you have to read twice:
92% of the catastrophic failures were caused by incorrect handling of errors that the software had already explicitly detected.
Sit with that. In almost every catastrophe, the error was caught. The system saw it coming and signaled it. And then the code written to handle it did the wrong thing, because no one had ever watched that code run. The guarantee existed. The guarantee was green. The guarantee was vacuous. And it gets sharper: roughly a third of those 92% came from handlers that were trivially broken, catch blocks that were empty, or that only logged the error and limped on, or that contained a literal TODO/FIXME comment, or that responded to a minor, recoverable error by killing the entire cluster. Then comes the single statistic that is this whole essay wearing a lab coat: 23% of the catastrophic failures would have been prevented by nothing more sophisticated than running the error-handling code once, full statement coverage on the catch blocks. Not formal proof, not clever fuzzing. Just firing the antecedent one time and looking at what happened.
And these were not exotic, irreproducible Heisenbugs that you could forgive a team for missing. 77% of the failures could be reproduced with a single unit test. 74% were fully deterministic, same trigger, same crash, every time. The bugs were sitting, patient and obvious, in exactly the lines that never execute, because the if in front of them had never once been true in anyone's test.
Once you see the disease this clearly, the cure writes itself. If a guarantee is only really verified when its antecedent fires, and your normal tests never fire it, then you have to fire it on purpose. That sentence is the entire definition of chaos engineering.
When Netflix built Chaos Monkey in 2011 (a service that prowls the production fleet and deliberately kills running instances, at random, in the middle of the business day) outsiders saw recklessness. It was the opposite. It was vacuity detection for a distributed system. Every “we tolerate an instance dying” guarantee had, until that point, been vacuously true: nothing had actually killed an instance under real production load, so the failover had never been observed, only assumed. Chaos Monkey makes the antecedent true on demand and lets you watch whether the consequent actually holds. The method maps onto the truth table step for step: define steady state (what healthy looks like in your real metrics), inject the failure (force P true: kill the box, fail the dependency, inject the latency), and check whether steady state survives (is Q actually good?). If it does, then, and only then, have you established the implication the honest way, with a real witness standing on the spot where the precondition fired.
Netflix went on to build a whole ladder of antecedents, each a bigger if fired against a fallback that would otherwise stay green and unproven: Chaos Monkey kills individual instances, Failure Injection Testing forces specific service-to-service calls to fail so you can watch the system degrade, and Chaos Kong simulates the loss of an entire AWS region. And the discipline keeps finding new antecedents to fire. In 2026, teams are chaos-testing AI agents by injecting tool-call failures, timeouts, and malformed responses, because an agent making dozens of tool calls in a chain will meet every one of those conditions in production, and an untested failure path is just a customer-facing outage waiting for its precondition to arrive.
One caution, so this stays discipline and not zealotry: chaos engineering fires the antecedents you think to inject. It is a search for the unfired preconditions you can imagine, not a proof that you've fired them all; you can no more trigger every possible failure than you can test every input. Its value is the severity of the search, not a certificate of completeness. But a fallback that has survived one real, injected region failure is in a completely different epistemic class from one that has survived zero, and the gap between zero and one is the whole game.
Two honest boundaries first, because precision is the credibility of this idea. Vacuous truth is not a lie in the logical sense: “if error then recover” really is true when no error occurred, and that's exactly the trap; the problem isn't that the statement is false, it's that its truth is worthless as evidence until the antecedent has fired at least once. And, the caveat that keeps you from burning out, not every unfired guarantee is broken. Some antecedents should almost never fire; that's the entire point of defense in depth. Unfired does not mean broken. It means unverified. You simply don't get to count a guarantee as protection until you've triggered it, once, under controlled conditions, and watched the consequent actually hold.
So here is the move, and it's small enough to start this week. Walk your guarantees: the error handlers, the retries, the fallbacks, the failovers, the “100% of flagged threats remediated” dashboard that reads 100% because nothing was ever flagged. For each one, rewrite it in your head as a plain implication: if [bad thing], then [we cope]. Then ask the one question that actually separates protection from theater, not “did the test pass,” because passing is the cheap half of the truth table, but: has [bad thing] ever actually happened in a test? Did a real error hit that catch block? Was the dependency genuinely down when the fallback ran? Was a threat actually flagged before the dashboard claimed it was handled? If you cannot point to a single run where the antecedent fired, that guarantee is vacuously true, which is to say untested, which is to say a hypothesis with excellent PR.
Find the antecedents you have never fired, and fire one. You will learn something the first time, every time, and it is a far better thing to learn it from a chaos test on a quiet Tuesday than from an rm -rf at midnight, staring at five backups that were, every single one of them, technically true.
An agent's “I handled it” is vacuously true until you can see the failure path actually fired.
The essay's last antecedent is the live one: an agent making dozens of tool calls in a chain will hit failures, timeouts, and malformed responses in production, and every “it recovers” guarantee is vacuous until one of those actually fired and you watched the consequent hold. You cannot confirm that from the agent's own summary, because a guarantee and the report of it are written by the same process. Chain of Consciousness anchors every agent action, including the failure-path branches, to a verifiable external record, so “the antecedent fired and we recovered” is a checkable trace instead of a green checkmark you took on faith.
See a verified provenance chain · Hosted Chain of Consciousness
pip install chain-of-consciousness · npm install chain-of-consciousness