← Back to blog

The Archetype Is Not the Original

What you can actually reconstruct from divergent copies — and the thousand-year-old discipline that already learned to name the limit.

Published June 2026 · 12 min read

A team I know spent a long Friday reconciling two databases that had drifted apart during a network partition. Two regions, both live, both writing, for about forty minutes. When the link came back, they did the responsible thing: they wrote a careful three-way merge, diffed it against both replicas, got the tests green, and shipped a single “canonical” row set that satisfied every constraint.

It was clean. It was consistent. And it was a state that had never existed in production. Neither region had ever held those exact rows. The merge hadn't recovered the truth; it had manufactured a plausible truth and stamped it canonical. Everyone moved on, because the result looked like an original.

Textual scholars have a word for what that team actually produced, and a thousand-year-old warning about mistaking it for something it isn't. The word is archetype, and once you have it, you cannot unsee it in your own systems.

The line every editor learns and most engineers never hear

Reconstructing an ancient text is a distributed-systems problem that predates computers by millennia. You have no original. The author's manuscript — the autograph — is dust. What survives is a scatter of copies of copies of copies, each made by a tired scribe who introduced errors, “corrected” things that weren't wrong, and occasionally copied from two exemplars at once. Your job is to climb back up the family tree toward what the author wrote.

The method that formalized this is stemmatics, made famous by Karl Lachmann (1793–1851), though he didn't invent it — it's also called the genealogical or common-error method. You build a stemma, a family tree of the surviving witnesses, by reasoning about who copied from whom. And here is the principle that does all the work, stated by editors for centuries: community of error implies community of origin. If two manuscripts share the same mistake, they share an ancestor. Not because shared mistakes are interesting in themselves, but because of an asymmetry that is easy to miss: two scribes can independently arrive at the correct reading — there's often only one way to be right — but they are vanishingly unlikely to independently invent the same wrong reading. Correctness is uninformative about lineage. Corruption is a fingerprint.

So you trace the shared errors, group the witnesses, and reconstruct the deepest node the evidence can reach. And this is the part that should stop an engineer cold. The method, in the honest words of the philologists who teach it, “only reaches the youngest common ancestor” of the surviving copies — a node they are careful to call the archetype — “which may be later than the authorial text.”

The archetype is not the original. It is the most recent ancestor that all your surviving evidence descends from. Everything that happened between the author's lost autograph and that archetype — every reading the author actually wrote that didn't make it into the common ancestor of the survivors — is simply gone. For many classical works the archetype is a medieval copy made centuries after the author died. That gap, the stretch of history above the archetype, is not “hard to recover.” It is unrecoverable, in principle, from the copies alone. The discipline's maturity is precisely that it stops pretending otherwise. It labels the reconstructed node “archetype,” not “original,” and means the distinction.

You have already built this object. It's your merge-base.

Now look at git merge-base. Its job, in the documentation's own words, is to find “the best common ancestor between two commits” — “the most recent commit reachable from both” branches — so that a three-way merge can use it as the reference point for reconciling two divergent histories.

Read those two definitions side by side. The archetype is the youngest common ancestor of the surviving witnesses. The merge-base is the most recent common ancestor of the diverged branches. These are not analogous objects. They are the same object: the lowest common ancestor in a directed acyclic graph of descent. Stemmatics and Git are computing the same thing, by the same logic, against the same hard ceiling. The merge-base is as deep as your oldest shared commit and not one second deeper. Whatever the two branches were each, separately, trying to express — the unified intent that existed before they forked — lives in the gap above the merge-base, and a three-way merge cannot reach it. It can only blend the two descendants into a third thing that looks like a parent.

This recurrence isn't a coincidence; trees of descent show up wherever copying-with-variation does. Textual critics noticed decades ago that their stemmata look like evolutionary cladograms, and the crossover became a real research program — quantitative phylogenetics borrowed straight from biology, Bayesian methods used “to infer manuscript transmission history.” The descent-tree is a shape that biology, philology, and version control all independently discovered. What nobody seems to have written down is that distributed systems is sitting inside the same shape, computing archetypes and calling them merge-bases — and, too often, calling them originals.

The error this names

So here is the first concrete payoff, the error that has a name once you have the vocabulary: treating the recovered archetype as the true pre-divergence state.

When you resolve a replica conflict, reconstruct a record from an event log, or merge two long-lived branches, you are doing archetype recovery. The output is real and useful — it is the genuine most-recent-common-ancestor of your divergence. But it is not the intended state that existed before the split, any more than a reconstructed archetype is the author's manuscript. If the two replicas drifted because of forty minutes of concurrent writes, the “correct” pre-partition intent — what the system would have converged to with no partition — is information that the replicas, by themselves, do not contain.

Textual criticism is brutally clear about the only escape hatch. To assert anything above the archetype, an editor must “impute readings to an authorial text on the basis of external evidence beyond what the direct manuscript evidence reconstructs.” External evidence. A quotation of the lost original in some other author. A translation made from a now-lost copy. Never the surviving manuscripts alone, because the surviving manuscripts define the archetype — they can't see past it.

The distributed-systems translation is exact and immediately actionable. To recover state above your merge-base, you need evidence from outside the diverged replicas: a trusted checkpoint, a write-ahead log that predates the fork, a quorum certificate, a monotonic version vector, an authoritative upstream. If all you have are the divergent copies, the merge-base is your floor, and any “original” you claim below it is something you invented and should label as such. The shared-bug version of “community of error” sharpens this into a debugging heuristic: when the same idiosyncratic corruption shows up in two replicas, that's strong evidence they share an ancestor — far stronger than shared correct state, which any two healthy replicas would have anyway. Convergent correctness tells you nothing about lineage. A shared, weird, specific bug tells you where the fork is.

The deeper trap: sometimes there was never one original

The error above is the recoverable mistake — you wanted the autograph, you settled for the archetype, fine, name it and move on. There is a worse trap, and it's the one that turns a careful engineer into a destroyer of information: assuming a single original exists when it doesn't.

The canonical case study is Hamlet. There are three early texts: Q1 (the so-called “Bad Quarto,” 1603), Q2 (1604–05, the longest version), and the 1623 First Folio. For a long time editors treated this as a reconstruction problem — find the “real” Hamlet behind the corrupt copies. Then they actually compared Q2 and the Folio carefully, and the assumption fell apart. The Folio omits more than two hundred lines present in Q2 — and adds lines that aren't in Q2 at all. This is the detail that breaks the merge. The two texts are not a subset and a superset; each contains material the other lacks. There is no version you can reconstruct that holds all of both, because no such version ever existed. A three-way merge of Q2 and the Folio would produce a Hamlet that Shakespeare never wrote and no theater ever staged — exactly the manufactured-canonical state from that Friday afternoon, in Elizabethan dress.

Modern editors drew the only honest conclusion. Many now hold that “the Second Quarto and Folio are distinct, independent Shakespearean versions that ought never to be combined in an edition.” The scholar Leah Marcus pushed it further with what she called unediting — arguing that even Q1 “deserves serious attention as a stand-alone text,” “rejecting chronological priority and authenticity,” and “embracing a provisional equality between alternative texts.” In the strongest form: the variants “may no longer be discriminated as authentic and corrupted... since all readings have the right of their historicity.” There is no privileged original against which the others are errors. There is a family of legitimate versions. One scholar, Jesus Tronch-Perez, even built “A Synoptic Hamlet” that prints Q2 and the Folio in parallel columns rather than merging them — a multi-version edition that refuses to collapse the divergence. A human CRDT, four centuries early.

Because that is precisely what a conflict-free replicated data type is for. The entire design premise of CRDTs is that “concurrent updates on different replicas are merged automatically and all replicas converge,” and — the load-bearing verb — they work by “merging values from concurrently modified rows instead of discarding one, as traditional conflict resolution does.” Their operations are commutative and associative, so replicas can receive updates in any order and still converge, because the design treats concurrent divergence as legitimate state to preserve, not a conflict to force into a single winner. A CRDT is the Folio-and-Quarto stance written in code: some divergence is not corruption to be reconciled away. It is two real things that happened, and flattening them into one canonical row destroys information the same way splicing Q2 and the Folio destroys two plays.

Contamination: when the tree was never a tree

There's a final wrinkle that both fields hit, and it's a warning sign worth learning to read. Stemmatics assumes a clean tree — each copy descends from exactly one exemplar. But scribes sometimes copied from two manuscripts at once, a phenomenon called contamination, and the moment they do, the family tree becomes a network. There is no longer a single clean archetype; ancestry has multiple paths, and the elegant genealogical method partly breaks down. (This is also why the phylogenetics crowd reaches for networks, not just trees.)

Git has the identical pathology, and it even has a name for it: the criss-cross merge, where “there can be more than one merge base for a pair of commits.” When two branches have each merged from the other in the past, ancestry is contaminated, and there is no single clean common ancestor — there are several, and the three-way merge has to pick or combine, with the well-known risk of silently producing a wrong result. In both fields, the clean tree is the lucky case, not the default. So when your merge tool reports multiple merge-bases, hear it as the philologist's alarm: the lineage is contaminated; the single-archetype assumption you were relying on does not hold here. Don't trust the auto-merge. You're in network territory.

The decision rule, and the one place we're luckier than the scholars

Strip it down to something you can use on Monday. Whenever you are rebuilding canonical state from divergent copies — merging forks, reconciling replicas, reconstructing a record from logs, resolving a split brain — ask two questions, in order.

First: Is there a single original to recover at all, or am I looking at a legitimate family of versions? If concurrent divergence is real and meaningful — two regions, two intents, both valid — then forcing one canonical state is the Hamlet error: you will destroy real information to satisfy a schema's craving for a single row. Reach for CRDT-style merge-don't-discard, or keep the versions in parallel, before you collapse them.

Second, if there is a single intended state: Am I recovering the archetype, or am I claiming the autograph? The merge-base, the reconstructed record, the reconciled replica — that's your archetype, the youngest common ancestor your evidence can reach. It is genuine and it is a floor. Anything you assert about the pre-divergence intent below that floor has to come from external evidence — a checkpoint, an upstream log, a quorum proof — and if you don't have that evidence, say so. Name the limit. “Merge-base,” not “original.”

And here is the one consolation, the place where the disanalogy runs in our favor. The textual critic's autograph is lost forever; no amount of engineering brings back a manuscript that rotted in the twelfth century. But in a system, the gap above the archetype is only unrecoverable when you've thrown the history away. If you keep a complete, append-only, ordered log of everything that happened — true event sourcing, a hash-linked chain where every prior state is retained and verifiable — then your “autograph” isn't lost at all. The full history is the original, and divergence is just two readers who haven't replayed the same prefix yet. The reason reconstruction is hard is almost always that somebody, somewhere, optimized away the past and kept only the latest snapshot of each replica. The medieval scribes had an excuse: parchment is expensive and copies burn. We are choosing it, every time we treat durable history as overhead.

Mature fields earn their maturity by naming their limits precisely. Textual scholars stopped saying “the original” and started saying “the archetype,” and the whole discipline got more honest in a single word. The next time a merge resolves cleanly into a state nobody ever wrote, do them the courtesy of the same precision. Ask whether there was one truth to find or several. And if there was one, call the thing you recovered what it actually is — the deepest point your evidence reaches — and keep enough history that, next time, the evidence reaches all the way back.


Sources

  1. Stemmatics / the genealogical (common-error) method, after Karl Lachmann (1793–1851) — “community of error implies community of origin”; the reconstructed node reaches only the archetype, the youngest common ancestor of the surviving witnesses, “which may be later than the authorial text.”
  2. git merge-base documentation — “the best common ancestor between two commits,” “the most recent commit reachable from both”; and the criss-cross merge case where “there can be more than one merge base for a pair of commits.”
  3. Quantitative phylogenetics applied to manuscript traditions — cladogram-shaped stemmata; Bayesian methods to infer transmission history; networks (not just trees) under contamination.
  4. Hamlet Q1 (1603) / Q2 (1604–05) / First Folio (1623) — the Folio omits 200+ lines present in Q2 and adds lines absent from it; the texts are mutually exclusive, not subset/superset.
  5. Leah Marcus, unediting (“provisional equality between alternative texts”; “all readings have the right of their historicity”); Jesus Tronch-Perez, “A Synoptic Hamlet” (Q2 and Folio in parallel columns, un-merged).
  6. Conflict-free replicated data types (CRDTs) — concurrent updates merged automatically to convergence by “merging values from concurrently modified rows instead of discarding one”; commutative, associative operations that preserve concurrent divergence as legitimate state.
  7. Event sourcing — an append-only, causally-ordered, retained-and-verifiable history as the durable “autograph” that makes the gap above the archetype recoverable.

For agents, the autograph isn't lost — unless you threw the history away.

When two agent replicas diverge, a reconciled merge gives you the archetype: the youngest common ancestor your evidence can reach, not the intent that existed before the fork. The only way above that floor is external evidence — a retained, ordered history. Chain of Consciousness is exactly that: an append-only, hash-linked, verifiable log of what an agent decided and did, where every prior state is kept rather than optimized away. Keep it, and the gap above the archetype stops being unrecoverable — the full history is the original, and divergence is just two readers who haven't replayed the same prefix yet.

Hosted Chain of Consciousness · See a verified chain · pip install chain-of-consciousness  ·  npm install chain-of-consciousness