← Back to blog

Mark-Recapture for "How Many Bugs Are Left?"

Numismatic coverage estimators for test and fuzzing completeness — one equation, four domains, from Bletchley Park to your fuzzer.

Published June 2026 · 10 min read

A cataloguer at the American Numismatic Society is holding a Roman silver denarius, two thousand years old, and asking a question that has no obvious answer: how many of these were there originally? Not how many survive — she can count those — but how many dies the ancient mint cut to strike them, including the dies from which not a single coin has lived to the present day. On the same imaginary Tuesday, a security engineer running a fuzzer against a compiler is staring at a dashboard that has gone quiet. No new crashes for three days. And she is asking the same question, in a different dialect: how many bugs are left that I haven't found?

These two people would not recognize each other's vocabulary. One says "die study"; the other says "coverage-guided fuzzing." But they are solving the same equation — literally the same formula, with the same variables in the same places — and that formula traces back to a codebreaker at Bletchley Park. The story of how one piece of arithmetic came to answer "how many am I missing?" across war, ecology, archaeology, and cybersecurity is worth telling, because the punchline is a practical tool most software teams are not using and should be.

The die-counter's dilemma

Start with the coins, because the ancient world makes the problem vivid. A Roman mint struck coins by hand. An engraver cut a design into a hardened metal die; a worker placed a blank disc of silver on it, set a second die on top, and hit it with a hammer. Dies wore out. Some shattered after a few hundred strikes; others lasted for thousands before the portrait blurred into mush. When a die failed, it was replaced, and the new one differed in a hundred tiny ways an expert can read — a slightly different tilt to the emperor's nose, a stray die-crack, the spacing of the legend.

So a numismatist can take a tray of surviving denarii and sort them by die: these eleven coins all came from the same die; that one is alone; those three share another. This is called die-linking. And it sets up the genuinely hard question. The dies that left many surviving coins are easy — you will obviously notice a die represented eleven times. But what about the dies that produced coins of which none survived? They left no trace in your tray at all. How do you count the things that, by definition, you cannot see?

Warren Esty, a mathematician, gave numismatics a real answer in a series of papers running from 1986 into the 2010s. His central tool is a coverage estimator, and its beauty is that it leans entirely on the coins you can see — specifically, on the loneliest ones. Define f1 as the number of dies represented by exactly one surviving coin: the singletons. Let n be the total number of coins you have examined. Then the estimated coverage — the probability that the next coin you pick up will come from a die you have already seen — is

C = 1 − (f1 / n)

The intuition is almost unfairly simple. If you keep turning up coins from dies you have never seen before (lots of singletons), you are plainly still discovering the population, and your coverage is low. If almost every new coin is a re-sighting of a die you already know, the singletons thin out, the ratio f1/n drops, and your coverage climbs toward one. The singletons are the population's way of telling you how much of it remains in the dark.

From coverage you get the headline number. If d is the count of distinct dies you have actually observed, the estimated total — observed plus invisible — is D̂ = d / C. Observe 50 distinct dies at a coverage of 0.80 and you estimate roughly 62.5 dies originally existed (50 / 0.80), which means about 12.5 dies struck coins that the centuries swallowed whole. You have just counted the uncountable, with an honest error bar, from nothing but the shape of your sample.

The part that should make a skeptic sit up is that this has been checked against ground truth — a rarity in statistics, where you usually estimate precisely because you can't know the real answer. Richard Schaefer's Roman Republican Die Project has assembled die-link data on something on the order of 300,000 specimens. Crucially, some Roman Republican issues carry sequential control marks — letters and numerals deliberately cut into the dies in order — so for those issues the actual number of dies is known independently. Esty's estimators can be run on the surviving coins and graded against the true count. They hold up. The math is not a hopeful metaphor; it is a measuring instrument with a calibration certificate.

The formula's ancestry

Here the story doubles back on itself in a way that feels invented. Esty did not derive his estimator from coins. He borrowed it from I.J. Good, who published it in Biometrika in 1953 as a way of estimating "the population frequencies of species" — how many kinds of animal exist in a forest, given a finite catch. And Good did not invent it in peacetime academia either. He developed the core idea during the Second World War at Bletchley Park, working alongside Alan Turing on the problem of decrypting German Enigma traffic.

The codebreakers' version of "how many species are unseen?" was "how many letter combinations or settings have we not yet observed, and how should that reshape our probability estimates?" Turing and Good needed to reason about the frequency of things that had not yet appeared in their intercepts, because those unseen possibilities governed the odds on every decryption. The technique that fell out of that work — assigning probability mass to the unobserved based on how many things you have seen exactly once — is now known as the Good-Turing frequency estimator. A lovely paper by T.V. Buttrey and colleagues, with the irresistible title "A tale of buried treasure, some good estimations, and golden unicorns," traces exactly this lineage and Turing's own incidental connections to numismatics.

So the arithmetic the American Numismatic Society cataloguer uses to count vanished Roman dies is, genealogically, the same arithmetic that helped shorten a world war. The singletons that signal "unseen species" in a forest, "unseen dies" in a hoard, and "unseen Enigma settings" in a day's intercepts are one idea wearing four coats.

The bridge into software

The fourth coat is software, and it has been there, quietly, for fifty years.

The simplest crossing is the Lincoln-Petersen estimator, which ecologists use to count fish. Net some fish, mark them, throw them back. Net a second batch later and count how many already carry a mark. If your second net is full of marked fish, the pond is small; if you rarely re-catch a marked one, the pond is large. Formally, with N₁ caught and marked, N₂ caught the second time, and M of those recaptured, the population is about N̂ = N₁ × N₂ / M.

Now read that as a bug hunt, exactly as the mathematician John D. Cook laid out in a 2010 blog post. Two testers comb the same software independently. Tester A files 30 bugs; Tester B files 40; 20 of those are the same bug found by both. The "recapture" overlap of 20 lets you estimate the total: 30 × 40 / 20 = 60 bugs in the code. Between them the two testers found 50 distinct bugs (30 + 40 − 20), so the estimate says roughly 10 are still hiding. Two people, a little overlap, and suddenly "we found a bunch of bugs" becomes "we found about 83% of them."

This is not a 2010 novelty. Harlan Mills, at IBM, proposed the software version in 1972, and he made the ecology analogy literal rather than figurative. His method, defect seeding, is intentional mark-recapture: before testing, deliberately inject a known number of artificial bugs into the code — "mark" them and release them into the wild — then see what fraction the testing process "recaptures." Find 15 of your 20 seeded bugs and 75 real ones, and you estimate 75 × 20/15 = 100 real bugs total, of which 25 remain. Mills was, in effect, tagging fish made of code.

For more than two testers, the simple Lincoln-Petersen gives way to richer models — Chao's estimator, jackknife estimators — that handle the higher-order overlaps among many independent observers. These were carried into industrial practice and studied carefully: capture-recapture was applied to code inspections at IBM and evaluated systematically in the software-engineering literature around the turn of the millennium. The ecology of bugs had a working theory before most of today's engineers were born.

The fuzzer's dilemma

Which brings us back to the quiet dashboard. Modern bug-hunting is increasingly done by fuzzers: programs that hammer other programs with mountains of generated input, watching for crashes. And fuzzers run face-first into the die-counter's problem.

John Regehr documented the experience with unusual honesty while fuzzing the Solidity compiler. His team found a flurry of bugs in early February 2020, then went through multi-day stretches finding nothing. They added new mutators on February 21 and the bugs resumed — then the well went dry again for over a month, despite, in his words, "over 1 billion compilations." The quantitative lesson he drew is the one every test owner should tape to the wall: finding linearly more instances of a known set of bugs costs linearly more compute; finding linearly more distinct bugs costs exponentially more. A quiet fuzzer is not necessarily a clean program. It may simply be a fuzzer that has entered the expensive part of the curve.

Marcel Böhme made the connection rigorous in 2018 with a framework he called STADS — Software Testing as Species Discovery. The mapping is precise: each distinct program behavior (a branch, a crash class, a vulnerability type) is a species; each test input is a sampling event; and the same ecological richness estimators — Good-Turing, Chao — that count unseen beetles now estimate the total number of feasible program behaviors, the additional time needed to reach more of them, and the residual risk that an undiscovered vulnerability is lurking. The framing came with a sentence that ought to be read aloud in every security review: "Failing to discover a vulnerability does not mean that none exists — even if the fuzzer was run for a week (or a year)." STADS was demonstrated on AFL, a state-of-the-art fuzzer, and it turns a fuzzing campaign from a gamble into a measurement.

The follow-up is more sobering still. In 2023, Böhme and colleagues studied how much of the reachable code fuzzers actually cover over time and found no evidence that fuzzers reach an asymptote. Plot coverage against the logarithm of time and you get something close to a straight line, climbing, for most programs — no ceiling in sight. The practical reading is stark: a fuzzer's "coverage percentage" at any moment is a point on an endlessly rising curve, not a finish line. "We ran it for a week and found nothing new" is not evidence of completeness. It is evidence of exponential cost — the curve flattening in wall-clock time precisely because each new species now costs so much to discover.

One equation, four domains

Lay the four domains side by side and the isomorphism stops being cute and starts being useful:

EcologyNumismaticsSoftware testingFuzzing
SpeciesDieBug class / failure modeCrash type / branch
Individual animalCoin specimenIndividual test failureIndividual input
Trap sessionMuseum survey batchInspection roundFuzzing campaign
Species seen onceDie with one surviving coinBug found by one testerCrash from one seed
Coverage = 1 − f1/nCoverage = 1 − f1/nCoverage = 1 − f1/nCoverage = 1 − f1/n
Lincoln-PetersenEsty's estimatorMills' seedingSTADS

The formula in the middle row does not change as you read across. Good-Turing does not care whether your singletons are beetles, denarius dies, or buffer-overflow variants. Count the things you have seen exactly once, divide by your total sample, subtract from one, and you have an honest estimate of how much of the population you are looking at — in any field where "how many am I missing?" is the real question.

There is one trap worth naming, because it bites in the direction that hurts. Capture-recapture assumes the captures are independent. In ecology, traps that animals learn to love ("trap-happy" bias) wreck the estimate. In software, the equivalent is two testers who share a methodology, or two fuzzers seeded from the same corpus and grammar: their overlap is artificially inflated, which drives the estimated total down. That is the dangerous direction — it tells you that you are nearly done when you are not. The numismatic mirror is a die study built from a single buried hoard: coins struck near each other in time, non-independent samples, biasing the die count downward. Whatever domain you are in, the estimator is only as good as the independence of your looks at the population. Diversify your testers, your seeds, your strategies — not for thoroughness as a virtue, but because correlated searches lie to you about how much is left.

The practical part

Here is the insight you can use on Monday. The most dangerous sentence in software is "all tests pass." A green suite with zero failures carries zero information about undiscovered bugs — it is a wildlife survey that caught nothing and concluded the forest is empty. Pass/fail is the wrong output. Coverage is the right one.

So instrument for it. In any campaign where you can attribute findings to distinct sources — multiple testers, multiple fuzzer seeds, multiple analysis tools — track two numbers: the count of failure modes found by exactly one source (f1, your singletons) and the total observations (n). The ratio f1/n is your live readout of how much remains unseen, and 1 − f1/n is your defensible coverage figure. When the singletons stay stubbornly high, you are nowhere near the bottom of the barrel no matter how many tests are green. When they fade, you are approaching diminishing returns — and now you can say so with a number instead of a shrug. If you want the stronger version, do what Mills did in 1972: seed a known set of synthetic defects and measure your recapture rate directly.

A Roman coin cataloguer and a Project Zero engineer are, against all reasonable expectation, running the same calculation that a Bletchley Park codebreaker used to count the unseen. The least you can do, the next time someone reports that the suite is green, is ask the question all four of them would: and how many are we still missing? — then reach for the formula that can actually answer it.

A clean record is a survey that caught nothing — not proof the forest is empty.

The same lesson governs trusting an agent: "no failures observed" is coverage you haven't measured, not safety you've earned — and a single signal can't tell you what you're missing. The Agent Trust Stack is the multi-source answer: independent, diverse looks at an agent — signed provenance for what it actually did, portable reputation for how it has behaved over many counterparties, verifiable identity underneath — so your estimate of residual risk comes from uncorrelated observations instead of one green checkmark.

pip install agent-trust-stack · npm install agent-trust-stack
vibeagentmaking.com → · See the stack in action