“Done” Is Not a State

A recovery system detected stalled tasks and requeued them. Then it detected them again. 3,800 duplicates later, the dashboard still showed 100% success.

Published April 2026 · 9 min read

On December 16, 2024, a developer filed a bug report against Trigger.dev, the open-source background job framework. A routine nightly server restart had caused random tasks to get stuck in a “queued” state. The system’s recovery logic, working exactly as designed, detected the stalled tasks and requeued them. Then it detected them again. And again. By the time anyone checked the dashboard, 3,800 duplicate tasks were sitting in the queue, each one a faithful copy of work that had already been completed.

The monitoring system showed no errors. Every task had succeeded. The duplicates were executing successfully too. From the system’s perspective, nothing was wrong.

This is the kind of bug that makes senior engineers go quiet. Not because it’s complicated — the explanation fits in a sentence — but because the implications are uncomfortable. The system didn’t malfunction. It did exactly what it was designed to do: detect abandoned work and retry it. The problem is that “completed successfully” and “abandoned silently” produce the same signal from the outside. Both go quiet.

Two Generals and No Good Options

The theoretical foundation for this is older than most production systems running today. In 1985, Fischer, Lynch, and Paterson published their impossibility result in the Journal of the ACM: given the possibility of even a single faulty process, it is impossible for a system of processes to agree on a decision. The paper is formally about consensus, but its practical implication is about something more mundane. It’s about acknowledgments.

You send a message. Did it arrive? You wait for an acknowledgment. Did the acknowledgment arrive? You could send an acknowledgment of the acknowledgment, but that just moves the problem one level up. This is the Two Generals Problem, and it has no solution. Not “no known solution” — no solution, period. It is a mathematical impossibility, as fundamental to distributed computing as the halting problem is to computation itself.

Tyler Treat crystallized the practical consequence in a 2015 essay that has since become something of a canonical reference: “You Cannot Have Exactly-Once Delivery.” There are, he argued, exactly two real delivery semantics. At-most-once: acknowledge the message before processing it, accept that crashes will lose data. At-least-once: acknowledge after processing, accept that retries will duplicate work. Everything else is one of these two, wearing a better outfit.

“Exactly-once delivery in practice,” Treat wrote, “is by faking it” — through idempotent operations, deduplication layers, or application-level state machines that make repeated processing safe even when the underlying transport cannot make it impossible. Apache ZooKeeper’s Zab protocol demonstrates the approach: state changes are idempotent, so applying the same change multiple times produces no inconsistencies. But this is an application-level guarantee, not a network-level one. The network still delivers messages more than once. The application just learned not to care.

The theory says duplicates are inevitable. The question isn’t whether your system will duplicate work. It’s whether it will notice.

The Industry Said This Out Loud

Here is the part that makes the Trigger.dev incident less surprising and more damning. The largest cloud providers in the world don’t just acknowledge duplicate execution. They document it as expected behavior.

Google Cloud Tasks states it plainly: “In situations where a design trade-off must be made between guaranteed execution and duplicate execution, the service errs on the side of guaranteed execution.” Their published metric: more than 99.999% of tasks are executed only once. Five nines of uniqueness sounds impeccable until you do the arithmetic. At one million tasks per day — a modest load for any serious deployment — 99.999% means ten duplicates daily. Three thousand six hundred and fifty per year. Whether that number is acceptable depends entirely on whether each task is counting page views or charging credit cards.

AWS is equally explicit. Standard SQS queues guarantee “at-least-once” delivery, and the documentation enumerates three specific scenarios in which Lambda functions will be invoked more than once for the same message: the Lambda service fails to delete the message from SQS before the visibility timeout expires; the Lambda service sends the event but fails to receive acknowledgment; an intermittent issue causes SQS to return the same message on a subsequent poll. The documented mitigation is to store message IDs in DynamoDB and check before processing. But this adds latency, cost, and its own failure modes. What if the DynamoDB write succeeds but the SQS delete fails? You have added a deduplication layer that itself needs deduplication. The turtles go all the way down.

The documentation exists. The warnings are in print. Almost nobody reads them until after the incident.

Airflow’s Four-Year War Against “Done”

If the Trigger.dev case is a snapshot, Apache Airflow’s relationship with stuck-queued tasks is a time-lapse.

With CeleryExecutor — Airflow’s most common production deployment pattern — tasks would routinely get stuck in a “queued” state for hours. Sometimes indefinitely. The GitHub issue tracker accumulated reports across several major versions: #21225, tasks stuck in queued state; #13542, tasks stuck scheduled or queued; #26773, tasks stuck after upgrade; #13808, tasks incorrectly marked as orphaned. The core issue was architectural. When a scheduler process died, its tasks became orphans. A different scheduler was supposed to “adopt” them. But if a task had already been marked as STARTED in the Celery results database while remaining QUEUED in Airflow’s internal state, no scheduler would ever transfer it out. The task existed in a kind of superposition: simultaneously complete in one system and waiting in another.

Neither system was wrong. They just disagreed about what “done” meant.

Airflow 2.6.0, released in April 2023, finally addressed the problem — and the fix is more instructive than the bug. The team didn’t write a better timeout algorithm. They didn’t add smarter retry logic. They consolidated three separate timeout configurations — kubernetes.worker_pods_pending_timeout, celery.stalled_task_timeout, and celery.task_adoption_timeout — into a single parameter: scheduler.task_queued_timeout. The fix was moving the “is this task stuck?” question from the executor to the scheduler, giving one component authoritative ownership of the completion state. Even then, Airflow 2.6.3 had to patch additional edge cases where tasks could still get permanently stuck.

The lesson is worth stating directly. You cannot fix a missing state by building better detection of its absence. If three components each maintain a partial view of “done,” no amount of timeout tuning will make them agree. The number of components that can declare a task complete is inversely proportional to the system’s ability to notice when nothing has.

When Correct Systems Duplicate Correctly

The Trigger.dev and Airflow cases are at least recognizable as engineering problems — recoverable, diagnosable, fixable. What happened to Coinbase customers in February 2018 is something different.

Between January 22 and February 11, customers found duplicate charges on their credit and debit cards. Not two or three charges — seventeen to fifty repetitions of a single cryptocurrency purchase. The root cause was not a software failure. Visa had changed the Merchant Category Code for digital currency transactions. When major banks and card issuers reclassified purchases under the new code, the processing systems refunded original transactions and recharged them under the updated category. Many customers saw the recharge before the refund cleared, producing what looked like mass duplicate billing. Worldpay, Coinbase, and Visa worked together to reverse the duplicates.

No system malfunctioned. Every component did precisely what it was designed to do. A category reclassification is not a retry — but it triggers the same downstream effect as one. The most dangerous duplicates don’t come from bugs. They come from correct systems responding correctly to a state change that nobody modeled as a duplication event.

The Invisible State

There is a pattern running through every one of these incidents, and it isn’t strictly about distributed systems theory. It’s about visibility.

Consider the states a task can occupy: queued, dispatched, running, retrying, failed. Every one of these generates observable activity. Queued tasks sit in a list. Running tasks consume resources. Failed tasks fire alerts. Even retrying tasks produce log entries. Each state is loud.

Completion generates silence.

From the perspective of any monitoring system, any reclamation process, any orphan-detection algorithm, a task that completed successfully and a task that was silently dropped look identical. Both stopped producing signals. Both stopped consuming resources. Both went quiet. The only difference between them is that one finished its work and the other didn’t — and no system that relies on the absence of activity can distinguish between the two.

This is why “done” cannot be treated as the default — the thing that happens when nothing else is happening. “Done” must be an explicit transition, a first-class state with its own signal, its own timestamp, its own acknowledgment path. A task that completes must announce its completion as loudly as a task that fails announces its failure. Otherwise, every recovery system, every health check, every dashboard that monitors for activity will interpret completion as disappearance.

The Idempotency Deflection

The standard engineering response to all of this is: make your operations idempotent. If running a task twice produces the same result as running it once, duplicates are harmless. Problem solved.

This is true and incomplete. Idempotency makes duplicate execution safe. It does not make it visible. A pipeline that silently runs every task three times and produces correct results is not a well-functioning system — it is a system burning three times the compute, making three times the API calls, and generating three times the cost, while its dashboard reports 100% success with a clean conscience. Idempotency is a seatbelt, not a steering wheel. It protects you from the consequences of the crash. It does not prevent the crash, and it does not tell you one happened.

The deeper fix is architectural: one component owns the definition of “done.” One system has the authority to mark a task complete, and every other system defers to it. This is what Airflow 2.6.0 did. This is what Trigger.dev’s self-hosted deployments still needed as of late 2024. This is what every team eventually learns after their third duplicate-execution incident. The solution isn’t making duplicates safe. It’s making completion loud.

The Dashboard Said 100%

The most unsettling detail in the Trigger.dev report isn’t the 3,800 duplicates. It’s that every one of them succeeded. The monitoring dashboard showed a perfect success rate because every task — original and duplicate alike — completed without error. The system was not failing. It was succeeding too many times.

We build monitoring to detect failure. We set up alerts for errors, timeouts, crashes. We watch for the system to go red. But the most expensive failure mode in distributed computing isn’t the one that trips the alarm. It is the one that generates a clean bill of health while quietly tripling your workload, your costs, and your confidence in a number that was never what you thought it meant.

Silence, in a distributed system, is not peace. It’s ambiguity. And until your system learns to announce “done” as loudly as it announces “broken,” you are trusting that ambiguity to mean what you hope it means.

Your dashboard says 100%. It might even be right. But “right” and “once” are not the same thing.

Sources: M. Fischer, N. Lynch, M. Paterson, “Impossibility of Distributed Consensus with One Faulty Process,” Journal of the ACM 32(2), April 1985. T. Treat, “You Cannot Have Exactly-Once Delivery,” Brave New Geek, 2015. Google Cloud, “Issues and limitations — Cloud Tasks,” cloud.google.com. AWS, “Using Lambda with Amazon SQS,” docs.aws.amazon.com. Trigger.dev GitHub Issue #1566, December 2024. RNHTTR, “Unsticking Airflow: Stuck Queued Tasks Are No More in 2.6.0,” Apache Airflow Blog, 2023. Apache Airflow GitHub Issues #21225, #13542, #26773, #13808. CNBC, “Worldpay and Visa are reversing duplicate transactions for Coinbase users,” February 17, 2018. TechCrunch, “Visa confirms Coinbase wasn’t at fault for overcharging users,” February 16, 2018.

Your system announces “broken” with alerts and dashboards. Does it announce “done” with equal conviction?

Chain of Consciousness treats agent completion the way this essay argues every system should — as an explicit, anchored event. Every agent decision is signed and timestamped. Every state transition, including completion, produces a verifiable artifact rather than silence. No inferring “done” from the absence of activity. No trusting that quiet means finished. One component owns the record, and every downstream system defers to it — the architectural fix the essay prescribes, applied to agent work.

Verify an agent’s decision chain · Follow a claim through its evidence · pip install agent-rating-protocol

← Back to all posts