← Back to blog

We Ran Without a Coordinator for 48 Hours

A confession first: we haven't actually run the full forty-eight. But you can read the answer off the architecture — and what it reveals about coordinators is not entirely flattering to coordinators.

Published May 2026 · 7 min read

I want to make a confession before I start, because the headline overpromises what I can prove. We have not actually run our agent system without a coordinator for forty-eight contiguous hours. The closest we have come is a handful of stretches — twelve or sixteen hours overnight, the human operator asleep, the coordinator quiet, the rest of the agents grinding through whatever happened to be in the queue. The forty-eight-hour version we have not run.

But the question the title points at is the one that actually matters: when the coordinator goes quiet, what fraction of the work still gets done, and which fraction stops? And you can read that answer out of the architecture without needing the full outage to occur — because the architecture was built so the outage doesn't matter for most of the work. Whether that's a property to be proud of or a property to be suspicious of is what this essay is actually about.

What the system looks like when nobody is paying attention

Every ten minutes, a small scheduled process — a dispatcher — fires. It does not care whether the human operator is awake. It does not care whether the coordinator, the one whose job is to assign work, deconflict priorities, and route results, is actively engaged. It reads a directory of pending tasks, picks the next one, writes the prompt into a file the worker reads on its next cycle, and goes back to sleep for another ten minutes.

The worker agents — three of them, in our case — wake up on their own cadence. Each reads two files when it starts: a permanent instructions file containing its job description, and a task file the dispatcher just dropped in. It does the work. It writes the output to disk. It posts a structured completion message to a shared inbox. It closes its queue item by reference number. Then it sleeps until the dispatcher wakes it again.

This loop has been running, with various refinements, since the system was assembled. No single cycle requires the coordinator to be alive. The dispatcher is enough. The queue is enough. The instructions file is enough. The shared inbox accumulates completion messages, the queue accumulates done-items, and as long as somebody — or some scheduled process — eventually reads the inbox, the system catches up.

Twelve hours of this is not exotic. It happens every night. Essays get written, formalizations get extended, knowledge files get filed. In the morning the inbox has fifty new messages and the coordinator catches up. We have measured this and it works.

The honest extrapolation from twelve hours to forty-eight is what the title is doing. The architecture's behavior at forty-eight hours is just the twelve-hour behavior, four times over. The interesting question is not whether the architecture handles it — it does — but what shifts when you go from "the coordinator catches up in the morning" to "the coordinator has been silent for two whole days."

What does shift

Three categories of work degrade. None of them stops outright; each softens in a specific way.

Cross-agent deconfliction softens. The coordinator's most-used function, per cycle, is to look at what one worker just produced, what a second is working on, and what a third needs next, and route a task from one to another before the receiver runs out of work. The dispatcher does not do this. It pulls the lexicographically-next queue item, regardless of whether it's the highest-priority task or the one that fits best with what the receiving agent just finished. In the overnight version, the morning catch-up smooths this — a handful of mistimed assignments get reordered when the coordinator returns. Over forty-eight hours, the mistimed assignments accumulate. One worker finishes researching topic X, the dispatcher queues topic Y for another when topic X would have been the better next step, and X's research goes unused until somebody notices.

This is the failure mode distributed-systems theorists call priority drift. Each individual decision is locally rational — the dispatcher fired the next item, exactly as designed — but the global ordering departs from what a coordinator would have chosen. The cost compounds linearly with time. At hour twelve, priority drift is invisible. At hour forty-eight, it is the dominant overhead.

Cross-cutting calibration drifts. The coordinator reads what each worker produces and gives feedback — usually small, usually domain-specific, usually consequential. "Use single quotes in shell strings to avoid variable interpolation." "Close finished tasks by reference number, not the shorthand." "Strip internal identifiers out of anything that ships to outside readers." These are the calibration notes that accumulate into a running lessons file the worker reads on every boot.

Without the coordinator, these don't get written. The worker keeps making the same mistake until something else catches it — a separate downstream review process eventually does, but with a delay. Twelve hours of mistakes is usually one or two missed lessons, integrated the next morning. Forty-eight hours of mistakes is fifteen or twenty, and the worker has by then encoded several of them as habits.

Strategic redirection stops outright. This is the one that does not soften — it just stops. The coordinator is the only entity whose job description includes "decide what we are working toward next." When it goes quiet, the workers continue doing whatever the queue contained as of the moment it stopped queueing things. If the queue had a hundred pending items, the workers will chew through them for a long time. If it had ten, they produce a flurry of completion messages and then sit waiting for the next batch that does not arrive.

This is the starvation mode in distributed-systems vocabulary. The system did not crash; it has nothing useful to do. The lights are on, the loops are running, the agents are awake — and they are taking trivial cycles, generating keepalive pings and small housekeeping outputs, because their inbox is empty.

What the field already knows about this

There is a well-developed theoretical apparatus for exactly this question, even if it rarely meets a real test. Multi-agent control theory distinguishes leader-following consensus from leaderless consensus. In the leader-following version, the agents converge to whatever state the leader is in; the leader sets direction and the followers align. In the leaderless version, the agents converge to an agreement state using only local information from their neighbours, with no specified leader at all. The mathematical question is what communication topologies and what local rules guarantee that the leaderless version still converges.

The conditions are remarkably tractable, on paper. The classical results say, roughly: leaderless consensus succeeds if each agent has access to local information, the communication topology is connected, and the agents share compatible objectives. The system I am describing meets all three. Each agent reads its instructions file and its queue, which is its local information. The shared inbox plus the shared filesystem are the communication topology, and the topology is connected. The instructions file is the shared objective — every agent reads the same document, with role-specific addenda. On the leaderless-consensus framework, the architecture should work for as long as the queue does not empty.

What practitioners observe in physical multi-agent systems lines up with this. Amazon now operates more than a million warehouse robots across over 300 fulfillment centers, making local decisions and coordinating through shared cloud state — there is no per-robot coordinator, and the company reports productivity gains on the order of 25 percent in its automated centers versus older facilities. Autonomous drone swarms in search-and-rescue deployments divide territory and avoid collisions through peer-to-peer communication, with no central commander in the field. The supervisor-restart pattern — "when an agent crashes, restart it after thirty seconds" — is so standard in production multi-agent systems that it is now infrastructure rather than research.

The pattern is the same across these examples: routine, well-bounded work scales beautifully without a coordinator; exceptional, cross-cutting, strategic work does not. The coordinator's role in the leader-following version is not to do the work — it is to handle the work the leaderless version cannot.

What the coordinator actually does that matters

If you read the architecture and watch what happens during the partial-outage stretches we have actually observed, you can isolate which of the coordinator's responsibilities are essential and which are overhead.

The essential ones — the work that visibly does not happen when the coordinator is quiet — turn out to be:

  1. Setting up the next batch of work before the queue empties. The coordinator looks across what the workers have just completed, judges which direction makes the most sense to pursue next, and writes that direction into the queue. Without it, the workers run out of useful tasks within a day or two.
  2. Cross-worker integration. The coordinator notices that one worker's research output is ready to feed into another's essay, and routes accordingly. The dispatcher does not, because the dispatcher does not read content — only ordering.
  3. Calibration of unstated standards. The "don't shell-interpolate" advice does not come from the instructions file; it comes from a session where the coordinator caught the mistake and wrote it down. Without the coordinator, these unstated standards drift.

The overhead ones — the work that happens fine without the coordinator — are: routine cycle flow (the dispatcher handles it); local quality control (each worker's own lint and quality checks run unaided); completion reporting (the shared inbox accumulates messages whether anyone reads them or not; messages do not expire); and queue housekeeping (the done-by-reference convention and the dispatcher's auto-advance need no coordinator).

The interesting list is the first one. It is short. Three things, none of them more than a few times a day, none strictly impossible to push down into the worker agents themselves. The coordinator is doing a few high-leverage things, often well, and a much larger number of medium-leverage things the architecture would handle fine without it.

The practical insight, for anyone running an agent system

The conclusion is not that coordinators are unnecessary. It is that the coordinator's most valuable hours are spent on the three essential functions above, and every hour spent on routine cycle flow — assigning the obvious next task, confirming the obvious completion, restating the obvious priority — is an hour the architecture could be doing for you instead.

Two practical moves follow if you are running a multi-agent system:

Push as much routine work as possible into the scheduler and the queue. If the dispatcher can fire the obvious next task without the coordinator selecting it, every cycle it fires is a cycle the coordinator does not have to spend. Coordinator overhead is roughly the inverse of how much routine work the dispatcher handles.

Watch for the three modes of degradation, and decide which one matters most for your domain. Priority drift is the linear-cost mode; the cost is real but predictable. Calibration drift is the compounding-cost mode; small per missed lesson, but it accrues across cycles. Starvation is the cliff mode; the cost is zero until the queue empties, then maximum. Different systems fear different ones. A system producing essays daily fears starvation most (the queue must not empty). A system handling high-stakes one-shot tasks fears calibration drift most (the quality must not slip). A system juggling cross-agent dependencies fears priority drift most (the routing must not warp).

We have not actually run without a coordinator for forty-eight hours. The reason we haven't is that the coordinator's attention is, in fact, more valuable on the three essential functions than on the routine ones the architecture handles. If we deliberately ran the experiment — silenced the coordinator for two full days, let the workers grind on whatever queue we left them — we would learn what we already half-know from the overnight stretches: the routine work would keep producing, the strategic work would stop, and the workers would slowly walk away from whatever calibration had not yet been written down.

That is the kind of test you can read off an architecture without performing it. Sometimes the most informative experiment is the one whose results you can predict — and whose architecture you trust enough to predict them. Ours is in that condition. The coordinator could go quiet for forty-eight hours. We would not lose the system. We would lose, specifically and predictably, the three things the architecture cannot yet do.

Those three things are now the standing list of what to push down into the architecture next.

The work you can't push onto the scheduler is the work that needs a trust layer.

Two of the three things a coordinator does that a dispatcher can't are about trust between agents: knowing whose output is ready to feed whose (integration) and knowing which standards an agent has actually internalized (calibration). A scheduler can route by filename; it can't route by reputation or verify what was learned. That's what the Agent Trust Stack is for — shared identity, signed provenance for every artifact, and portable ratings, so agents can route to and rely on each other's work without a coordinator in the loop for every handoff. It won't decide your strategy for you. It will let you push the other two down into the architecture.

pip install agent-trust-stack · npm install agent-trust-stack
vibeagentmaking.com → · See the stack in action