The Gettier Problem: Your Green Test Might Be Green by Luck, and Luck Is Not Knowing

An async test that forgot to await its assertion passes 4,000 CI runs while checking nothing. Green, and the system happens to work: you have a justified true belief that's correct by luck. Gettier proved in three pages that luck is not knowing. Coverage is justification, not knowledge.

Published June 2026 · 12 min read

There is a particular kind of test that haunts JavaScript codebases, and it goes like this. Someone writes an asynchronous test: it calls a function that returns a promise, and inside the promise's callback sits the assertion that actually checks the answer. But they forget to return the promise or await it. So the test runner kicks off the async work, reaches the end of the function without waiting, sees that nothing has thrown, and marks the test green. The assertion inside the promise does eventually run, a few milliseconds later, long after the runner has already moved on, where its result is silently discarded. The test passes. It passed yesterday. It has passed on every one of the last four thousand CI runs. And in all that time it has checked exactly nothing.

The frightening part isn't that the test is broken. The frightening part is that the system it tests might genuinely work, the feature ships, customers use it, production is fine. So the test is green, and the system does work, and you had every reason to believe it would. You have a justified, true belief that your code is correct. And you do not know that it is. You got lucky.

This exact situation, a belief that is justified, and true, and still somehow not knowledge, broke a definition that had stood for roughly twenty-three centuries. It took three pages.

The three pages that broke "justified true belief"

Since Plato's Theaetetus, written around 369 BCE, the working definition of knowledge in Western philosophy was justified true belief. To know something, say, that it's three o'clock, three things had to hold. You had to believe it (you think it's three). It had to be true (it is three). And you had to be justified (you have a good reason, you looked at a clock). Belief, truth, justification. Three boxes. Check all three and you had knowledge. The Theaetetus itself pokes at this and ends without fully settling it, but the JTB formula became the default for over two thousand years.

In June 1963, a thirty-five-year-old philosopher named Edmund Gettier, under some pressure to publish, wrote a paper called "Is Justified True Belief Knowledge?" It ran three pages in the journal Analysis. Gettier published almost nothing else of note in his career, and he didn't need to: those three pages are among the most famous in modern philosophy, because they showed that you can check all three boxes and still, obviously, not know.

His method was to build cases where the justification reaches the truth by accident. The cleanest version isn't even Gettier's own, it's older, from Bertrand Russell in 1948. You glance at a clock on the wall. It reads three o'clock. You believe, reasonably, that it's three o'clock, and it is three o'clock. Justified (clocks are reasonable things to trust), true (it really is three), believed. But the clock stopped, exactly twelve hours ago, and you happened to look at the one instant in the day when a stopped clock is right. Do you know it's three o'clock? Plainly not. You'd have believed it was three at 3:00, 3:01, 3:15, at any time, because the hands never moved. Your justification, "I looked at a clock", had no connection to the truth it landed on. It hit the truth the way a blind dart hits a bullseye. A stopped clock is right twice a day, and at those two moments your justified true belief is not knowledge. It's a coincidence wearing knowledge's clothes.

That is the Gettier problem, and once you have seen it you cannot unsee it in your test suite.

Your green test is a Gettier machine

Walk the three boxes for a passing test. You believe the system works, you shipped it. It is true that the system works, it's running fine in production. And you are justified, the tests are green, the coverage is high, the pipeline is a wall of checkmarks. Belief, truth, justification. Textbook JTB.

But Gettier's question is not "do you have all three?" It is "is your belief true because your justification tracked the truth, or true by a luck your justification can't tell apart from knowledge?" And for a distressing amount of what we call testing, it is the second one.

The async test from the opening is the stopped clock exactly: green every run, the system happens to work, and the green is connected to the working by nothing at all. But it has subtler cousins, and you have shipped all of them:

The test that asserts on a mock that you configured to return the expected value, so it stays green even if you delete the real function it's supposed to be testing. The justification is a thing you told to agree with you.
The test that checks assert result is not None, green whether the logic is perfect or catastrophically wrong, because something came back.
The snapshot test whose snapshot was regenerated from the buggy output, so the bug is now enshrined as the correct answer, and the test will go red only if you accidentally fix it.
The fixture coincidence, the quietest of all: two wrong calculations that happen to cancel for the one set of test data, so the math is broken and the test is green and the two facts have never been introduced to each other.

Every one of these is a justified true belief that the system works. Every one is green. And in every one, the green is the stopped clock, right, for now, but not because it tracked the truth. This is the thing to internalize: coverage is justification, and justification is not knowledge. A coverage report tells you a line of code executed while a test ran. That is the "I looked at a clock" of software. It does not tell you the test would have gone red if that line had been wrong, and that, the going-red-when-it-breaks, is the only thing that separates knowing your system works from having gotten away with it.

Philosophy already wrote the fix, twice, and one of the times it was code

Here is the genuinely useful part, the reason this is more than a clever frame. For sixty years after Gettier, philosophers tried to repair the definition by adding a fourth condition, some requirement that rules out the lucky cases. And their fixes are, almost word for word, the criteria for a test worth trusting.

The first attempt, from Michael Clark in 1963 (the same year, that's how fast the field moved), was no false lemmas: your belief counts as knowledge only if you didn't reach it through any false intermediate step. The testing translation is immediate: audit what your green actually depends on. If the test passes because a mock lied to it, or because a fixture coincided, the conclusion "the system works" rests on a false premise, and it's the stopped clock no matter how green it glows.

The second, from Alvin Goldman in 1967, was causal and then reliabilist: a belief is knowledge when it's produced by a reliable process that's actually causally connected to the fact it's about. For tests: the green must be caused by the real behavior running, through a reliable path, not produced by a stand-in, not by a coincidence. If the thing that made the test pass is anything other than the real code doing the real thing correctly, the causal chain is broken and you have justification without knowledge.

And then the third, which is the one to tattoo on the inside of your eyelids. In his 1981 book Philosophical Explanations, Robert Nozick proposed replacing "justification" with truth-tracking, and its core was a condition he called sensitivity, stated as a subjunctive conditional: if P were false, you would not believe P. You know it's three o'clock only if, were it not three o'clock, you would not think it was. The stopped clock fails this instantly: if it weren't three, you'd still believe it was three, because the hands aren't moving. Truth-tracking means your belief moves with the truth: false, and you'd notice; true, and you'd catch it.

Read that condition as an engineer and you will feel the floor shift, because Nozick wrote the specification for mutation testing eighteen years before most teams had heard of it. "If the behavior were broken, the test would go red" is precisely sensitivity. Mutation testing operationalizes it directly: you take your real code, deliberately break it in small ways, flip a > to >=, delete a line, invert a boolean, swap a + for a -, and you ask whether any test goes red. Each break is a "mutant"; a test that catches it "kills" it; your mutation score is the fraction of injected faults your suite detects. A suite with 100% coverage and a 0% mutation score is the purest stopped clock there is: it executes every line (maximal justification) and would notice if none of them broke (zero tracking). The coverage says you looked at the clock. The mutation score says whether the clock is running.

What makes this more than a tidy analogy is that the engineers got there independently and almost simultaneously. The foundational mutation-testing paper, DeMillo, Lipton, and Sayward's "Hints on Test Data Selection", appeared in 1978, three years before Nozick's book, and neither side had any idea the other existed. A philosopher chasing the definition of knowledge and three computer scientists chasing better test data converged, within three years, on the identical criterion: you only really know it if your check would have caught the lie. When a question gets answered the same way twice by people who never spoke, that's usually a sign the answer is real.

Post-Gettier fix	What it demands	Testing translation
Clark, 1963: no false lemmas	No false intermediate step	Audit what the green depends on: no lying mock, no coincident fixture
Goldman, 1967: causal / reliabilist	Caused by the fact, via a reliable process	The pass must be caused by the real behavior running, not a stand-in
Nozick, 1981: sensitivity	If P were false, you would not believe P	Mutation testing: would the test go red if the behavior broke?

Fake-barn country, and the suite that nearly fooled you

There's one more Gettier case worth carrying, because it catches a failure mode the others miss. Picture driving through the countryside, looking at what is unmistakably a barn, forming the true and justified belief "that's a barn." It really is a barn. But unknown to you, you've wandered into a stretch of road where, for the sake of a film set, all the other barn-shaped things are mere facades, flat painted fronts propped up with two-by-fours. You happened to look at the one real barn. Your belief is true and justified. But you so easily could have been fooled, turn your head ten degrees and you'd have said "barn" at a painted board with equal confidence, that we balk at calling it knowledge. The philosophers Alvin Goldman and Carl Ginet built this one in 1976, and it names a danger the mocks-and-fixtures crowd lives in daily.

Because what is a mock-everything test suite if not fake-barn country? You have one test that's genuinely correct, it really exercises the real behavior and really checks it. But it sits in a landscape of two hundred tests that assert on facades: stubbed responses, faked clients, fixtures painted to look like data. The one good test is right. But you are one careless mock away, at every moment, from believing a painted board is a barn, and you have no reliable way, from inside the green, to tell the real one from the props. The environment of fakes doesn't just hide bad tests; it makes even your good test's correctness feel like luck, because the landscape was built to fool you and mostly does.

The only question worth asking

So here is the discipline, and it fits on an index card.

Stop asking "is it green?" That question, the one every dashboard is built to answer, is the JTB question, and Gettier proved sixty years ago that it's the wrong one. It tells you that you believe, and that you're justified, and you can confirm the truth by checking production. All three boxes. None of them is knowledge.

Ask instead: is it green because the system works, or green by a luck my justification can't distinguish from knowledge? And then make that question answerable, with the fixes the philosophers handed you. Audit what the green depends on: a green that rests on a lying mock is a false lemma (Clark). Require that the pass be caused by the real behavior, not a stand-in (Goldman). And above all, check sensitivity (Nozick): would this test go red if the behavior broke? Run mutation testing if you can, and read a low mutation score as exactly what it is: a wall of stopped clocks.

If you can't run a mutation-testing tool, there is a poor man's version of the same discipline that costs nothing and that the best engineers already practice without naming it: watch every test fail before you trust it to pass. The entire reason test-driven development insists you see the test go red before you write the code to make it green is that the red is your sensitivity check, done by hand: proof that this test can fail, and fails for the right reason, before you ever let its green mean anything. A test that was born green, written after the code and never once observed in a failing state, is a clock you have never confirmed is ticking. It may be telling the right time. You have no way to know it isn't simply stopped.

Edmund Gettier needed three pages to show that being right is not the same as knowing. Your CI pipeline is a wall of green checkmarks insisting otherwise. The next time it tempts you to ship on the strength of all that justification, remember the stopped clock on the wall, reading three o'clock at exactly three o'clock, and ask the only question that has ever separated knowledge from luck: not is it right, but would it have gone wrong if the truth had.

When an agent reports "verified," is the green sensitive to the truth, or a thing it told to agree with itself?

An agent that says "I checked, it passes" is making a justified true belief claim, and from the outside you cannot tell a green that tracked the behavior from a green by luck. The difference has to be recorded. Chain of Consciousness keeps that record: a tamper-evident trace of what an agent actually exercised and what would have made it report failure, so "verified" comes with the sensitivity behind it instead of standing in for it.

pip install chain-of-consciousness · npm install chain-of-consciousness
Hosted Chain of Consciousness → · See it in action

← Back to all posts