The Bundle Protocol of Asynchronous Agent Trust

A space protocol moved 34 million data bundles at 100% success across links that are down most of the time. It already solves the agent-disconnection problem we treat as an edge case.

Published June 2026 · 11 min read

In 2024, NASA's PACE satellite, an Earth-observing climate mission watching plankton, aerosols, and clouds, sent home about 34 million "bundles" of data, on the order of three and a half terabytes a day, across links that simply are not there most of the time. PACE is in low Earth orbit; it can only talk to the ground during the minutes it passes within reach of an antenna, through a handful of ground stations scattered from Alaska to Virginia to Chile to Norway, twelve to fifteen brief contacts a day. Between those windows, there is no link at all. The success rate across those 34 million bundles was not five nines. It was 100%.

It managed that because it never assumed the network was up. When PACE passes out of contact, it does not fail, retry, or drop data. It holds the data, stays responsible for it, and forwards it the instant a link returns, automatically, with the logic baked into its flight software. PACE is the first NASA Class-B mission to fly this stack operationally for its telemetry, and it works because it is built on a protocol whose founding assumption is the exact opposite of everything we build on the ground: the network is partitioned by default, and connectivity is the rare, precious exception.

Now picture your AI agent during an outage: a regional cloud blip, a model-API rate-limit storm, an adversarial network partition someone induced on purpose. It is in precisely PACE's situation, cut off, mid-task, holding work it is responsible for and unable to reach whatever was supposed to verify or receive it. And here is the uncomfortable part. The space-networking community solved the shape of this problem twenty years ago, wrote it into an IETF standard, and flew it to a 100% success rate, while the agent world still treats disconnection as an edge case to be swatted with a try/except. The reliability that agent infrastructure is missing is sitting, fully specified and battle-tested, in a protocol built for talking to spacecraft.

Partition is the normal case

The protocol is Delay-Tolerant Networking, and it began life with a gloriously literal name: the Interplanetary Internet. Vint Cerf, yes, that Vint Cerf, and colleagues asked, around the turn of the century, how you would network across planets, where a one-way trip is minutes to hours, where bandwidth is scarce, and where links vanish and reappear on orbital schedules nobody can override. You cannot run TCP to Mars. The handshake alone would time out before the first packet arrived, and even if it didn't, the link would be gone by the time you needed to acknowledge anything.

So DTN inverts the internet's core assumption. The regular internet assumes an end-to-end path exists right now and treats a break as an error. DTN assumes there is no continuous path, treats partition as the steady state, and treats a moment of connectivity as the lucky exception you must exploit immediately. The current standard is RFC 9171, published by the IETF in January 2022: Bundle Protocol version 7, BPv7.

And that inversion is the whole insight for agents. An agent in an outage is not suffering an anomaly that good error handling will mop up. It is living in DTN's normal case. Partition is not the exception to design around. For anything operating at real scale across real networks, it is the condition to design for. Most agent protocols assume the opposite, which is why they shatter the moment the assumption fails.

Store-carry-forward: the third option

The core model BPv7 standardizes is called store-carry-forward, and it is the move that dissolves the worst dilemma in agent design.

A BPv7 node that cannot forward a bundle right now does two things that ordinary networking does not. It does not drop the bundle, which would be failing open, letting the work evaporate. It does not reject it back to the sender, which would be failing closed, blocking the sender until the network heals. Instead it stores the bundle in persistent storage, carries it, and forwards it when a usable link appears. The bundle waits inside the network, intact and accounted for, for however long the blackout lasts.

Hold that against how agents verify things today. The dominant pattern is synchronous, RPC-style verification: before an agent takes an action, it calls out to some authority to check. When that authority is unreachable, the exact moment a partition hits, the operator is forced into a miserable choice. Let the action through unverified, and you have failed open: unsafe, but available. Block the action, and you have failed closed: safe, but down. Which mistake your system ships depends, honestly, on which one burned the operator most recently. I have watched teams flip that switch back and forth for years.

Store-carry-forward is the third option nobody offers agents: fail by holding. The agent that cannot reach the next step doesn't let the task through and doesn't bounce it back. It keeps the task, stays accountable for it, and completes it when it can. Not open, not closed, pending. That single reframing turns the operator's bad either/or into a non-choice, and it is the first thing the Bundle Protocol can teach an agent architect.

The artifact carries its own state

Holding a task across a blackout only works if the task can survive without the surrounding world staying still. BPv7 has two mechanisms for exactly that, and both are quietly profound.

The first is the Bundle Age Block, block type 7. It records the number of milliseconds elapsed between the bundle's creation and its most recent forwarding: the bundle's own age, carried inside the bundle. It becomes mandatory precisely when the originating node has no trustworthy clock, so its creation timestamp is zero. In that case, expiration is tracked not against wall-clock time, which the node cannot read, but against elapsed transit, which the bundle measures for itself as it goes.

For agents, this fixes a silent and vicious bug: clock disagreement. In distributed systems, two nodes that both believe they hold authoritative time, and disagree, produce corruption nobody notices until much later. A bundle that carries its own age does not need any two agents to agree on what time it is. It only needs them to agree on how long it has been alive, and it tells them. The accountability metadata travels with the work, instead of depending on a shared clock that, during a partition, will not be shared.

The second mechanism is late binding. BPv7 lets you address a bundle to a logical destination, an endpoint identifier, written like ipn:42.1, and resolves that logical name to an actual physical network address only at forwarding time, not at creation time. You name the who; the where gets bound late, when there is finally a link to bind to. For agents, this means you do not need to know, when you create a task, which node or region or replica will ultimately handle it. You bind to the logical recipient and let the physical one be resolved when connectivity returns, so your addressing survives the topology rearranging itself underneath you during the very outage you are trying to ride out.

The live design question: custody in the core, or bolted on?

Now the part where a naive telling of this story gets the facts wrong, and where the true version is richer.

The strongest accountability primitive in the DTN world is custody transfer: a node doesn't merely forward a bundle, it accepts custody of it, becomes formally responsible for that bundle's integrity and for retransmitting it until it can hand custody to the next node. Accountability follows the bundle, not any central operator. That is, almost word for word, the primitive agent systems need: the agent currently holding a task is its custodian, owns it through the blackout, and stays on the hook until it transfers custody onward.

But here is the fact a protocol-literate reader will check you on: custody transfer is not in BPv7's core. It was a core feature of the previous version, BPv6 (RFC 5050), and the designers of BPv7 deliberately removed it. The old custody-signal machinery is simply unassigned in the new standard. In BPv7, custody-transfer behavior lives outside the core, as a separable extension, most notably Bundle-in-Bundle Encapsulation (BIBE), whose custody mechanism is explicitly adapted from the procedures in the old RFC 5050. The community looked at its strongest accountability guarantee and decided it was too heavy to mandate in the core, and better offered as a layer you add when the stakes justify it.

And then the twist that proves the primitive was never dispensable: in 2025, a NASA team led by Rachel Dudukovich published work re-adding custody transfer, along with compressed status reporting, to BPv7. The field took custody out of the core, ran without it, and found it indispensable enough to engineer back in.

That round trip is not a footnote. It is the exact design question agent-accountability faces right now, already run as a twenty-year experiment with the results in. Should "the agent holding the task owns it" be a core protocol guarantee that every agent must honor, or a pluggable extension you bolt on where the consequences are severe enough to pay for it? DTN's answer, hard-won: keep store-carry-forward in the core for everyone, and offer custody as a strong, separable, and increasingly wanted extension. An agent architect gets to skip the twenty years and copy the conclusion.

How noisy can accountability get? Count it.

One more practical gift, and I'll show my work, because this number is not in the RFC. It is a bound you can derive from how BPv7's status reports behave. The protocol lets nodes emit status reports at four points in a bundle's life: when it is received, forwarded, delivered, and deleted. Turn that on naively across a long path and you can flood the network with bookkeeping during the exact outage you were trying to survive.

So bound it. Suppose a bundle crosses a path of N nodes. Count the worst case: the source emits one report when it first forwards the bundle; then each of the N−1 downstream custodians can emit up to two, one on receipt, one on forwarding (the final node's "forwarded" simply becomes "delivered"). That totals 1 + 2(N−1) status messages in the worst case. The exact constant matters less than the shape: accountability chatter grows linearly with path length, not combinatorially. That is the reassuring part, and it is a number agent-protocol designers have, as far as I can find, simply never bothered to compute, which is how "verify everything" quietly turns into a status-report storm at the worst possible moment. (The field is already shrinking even the linear cost: the same 2025 NASA paper adds compressed status reporting precisely to tame this.)

What no agent protocol does yet

To be precise about the novelty, because precision is the whole point here: the agent world is not short of communication protocols. MCP, A2A, ACP, ANP, the current crop of agent-interoperability standards, handle asynchronous messaging perfectly well. But they assume connectivity. They do not treat disconnection as the normal case, and they do not give accountability a custodian that travels with the task. Earlier multi-agent trust work, like TrustMAS, tackled trust but not partition-tolerance. The gap, stated exactly, is this: no agent protocol assumes partition is normal, holds the task through it, and lets accountability ride with the artifact instead of the operator.

That last phrase is the entire thing, and it is the pattern worth building toward. You do not keep an agent accountable across an outage by keeping a central ledger reachable, because during the outage it will not be. You keep it accountable by attaching the custody and the provenance to the task itself, so that whichever agent is holding it, through whatever partition, carries the responsibility with it and can prove the unbroken chain when the link comes back. Accountability follows the artifact, not the operator. It is the same answer that keeps surfacing for agent trust from every direction: you cannot rely on inspecting the central thing; you attach the record to the work and let it travel. DTN wrote that down two decades ago and flew it to 100%.

What to do with this

If you build agent systems, the move costs nothing but a change of assumption. Stop designing for the connected case and catching disconnection with error handling, and instead design for partition as the default. Then give your tasks the four things the Bundle Protocol has had for twenty years:

Store-carry-forward. Let an agent that cannot reach the next step hold the task and stay responsible for it, rather than failing open or failing closed. Make "pending" a first-class outcome.
A self-describing age. Let the task carry its own elapsed life, so it survives two agents disagreeing about the clock instead of silently corrupting on it.
Late binding. Address the logical recipient; resolve the physical one when a link actually returns, so your routing survives the topology shifting mid-outage.
Custody that travels. Make responsibility ride with the task, as a strong, separable extension where the stakes justify the weight, which is the conclusion DTN reached the hard way.

Score your current agent stack against those four and most will score close to zero, because they were built on the single assumption a spacecraft could never afford: that the network is there. Recall the contrast: recent analyses put first-attempt agent task success around a quarter, while a partition-tolerant space protocol moved 34 million bundles without losing one. The difference is not intelligence. It is that one of them assumed the link would drop and built for it.

PACE held its data across the dark and lost none of it, because it never believed the link would stay up. Your agents are behind the sun more often than you think. Give them a bundle to carry, and someone to be accountable to when they come back into the light.

Sources

NASA PACE mission DTN results: ~34 million bundles delivered at 100% success in 2024, ~3.5 TB/day across 12–15 daily ground contacts via four Near Space Network antennas (Alaska, Virginia, Chile, Norway), DTN embedded in flight software with automatic downlink resumption, and PACE as the first NASA Class-B mission to fly DTN operationally for telemetry (NASA, "NASA's Near Space Network Enables PACE Climate Mission to 'Phone Home'"; NASA DTN overview). RFC 9171 (IETF, January 2022), Bundle Protocol version 7: the store-carry-forward core model, the Bundle Age Block (block type 7, carrying milliseconds elapsed between creation and most recent forwarding, intended for nodes lacking an accurate clock and mandatory when the creation timestamp is zero), and late binding of overlay endpoint identifiers to underlying network addresses at forwarding time (rfc-editor.org). Delay-Tolerant Networking / the "Interplanetary Internet" origin (Vint Cerf and colleagues). The custody-transfer correction: custody transfer was a core feature of BPv6 (RFC 5050) and was deliberately removed from BPv7's core, with custody behavior provided outside the core via Bundle-in-Bundle Encapsulation (BIBE), whose mechanism is adapted from RFC 5050 (DTN survey, HAL hal-05190388; Wikipedia, "Delay-tolerant networking"), and its 2025 re-addition: Dudukovich et al., "Custody Transfer and Compressed Status Reporting for Bundle Protocol Version 7" (arXiv 2507.17403, 2025). The "1 + 2(N−1)" status-report figure is the author's own derivation from BPv7's four status-report types (received/forwarded/delivered/deleted), presented as an illustrative worst-case bound (linear in path length), not a quotation from the RFC; the same 2025 NASA paper's compressed status reporting addresses the chatter it describes. Agent-protocol differentiation: MCP, A2A, ACP, ANP handle asynchronous agent messaging but assume connectivity and do not provide DTN-style custody/partition tolerance (survey, arXiv 2505.02279); earlier multi-agent trust work such as TrustMAS (arXiv 0808.4060) is not DTN-based; no existing agent protocol was found that treats disconnection as the normal case with accountability traveling with the task. The agent first-attempt success rate (~a quarter) is from 2026 secondary agent-reliability analyses (e.g., APEX-Agents), cited as directional. The synthesis (DTN as a twenty-year-deployed reference architecture for agent disconnection, "fail by holding" as the third option beyond fail-open/fail-closed, the custody-in-core-vs-extension design question, and accountability that follows the artifact rather than the operator) is the essay's own argument.

Accountability follows the artifact, not the operator.

The whole conclusion is to attach the custody and the provenance to the task itself, so an agent can prove the unbroken chain when the link comes back instead of relying on a central ledger that the outage made unreachable. Chain-of-consciousness records an agent's reasoning and actions as it works, so the record travels with the work, a custody trail you can verify after the fact rather than a server you had to reach during it.

pip install chain-of-consciousness · npm install chain-of-consciousness
Hosted Chain-of-Consciousness → · vibeagentmaking.com

← Back to all posts