Codicology for Compiled Code

Triangulating authorship when git blame lies — what medieval manuscript authentication teaches software forensics about the evidence nobody designed.

Published April 2026 · 10 min read

Open a thirteenth-century parchment codex to any two-page spread and look at the surface, not the text. On the left page, tiny follicle marks where hair once grew — a faint roughness, a slightly yellowed cast. On the right page, the same. Flip forward: smooth, cream-colored flesh side faces flesh side. This alternation — hair to hair, flesh to flesh — runs through the entire book with the regularity of a heartbeat.

No medieval scribe thought of this as a security feature. They were making a beautiful book. Hair side is rougher and darker than flesh side; alternating them within a spread would look jarring, so bookmakers arranged their parchment sheets to keep each opening visually consistent. Caspar René Gregory, an American textual scholar working in nineteenth-century Germany on New Testament manuscripts, was the first to formalize what this incidental regularity meant for authentication. When a leaf has been inserted into a quire — a forgery, a later interpolation, a replacement — the hair-flesh alternation breaks. A checksum failure, centuries before checksums.

Gregory’s Rule, as codicologists now call it, illustrates a principle that software forensics has been slow to learn: the most diagnostic evidence of tampering is often the evidence nobody designed.

The paleographic playbook

Manuscript scholars never trust a single dating indicator. The standard method — described in paleography handbooks from Bernhard Bischoff’s Latin Palaeography (Cambridge UP, 1990) onward — triangulates across multiple independent evidence types.

Script. A trained paleographer dates Latin manuscripts to within a quarter-century by letterform alone, for most of the period after 800 CE. The angle of the pen, the proportions of ascenders and descenders, the treatment of ligatures — these are as individual as handwriting, because they are handwriting.

Watermark. For paper codices, Charles-Moïse Briquet’s Les Filigranes (1907) catalogued European paper-mill watermarks so thoroughly that matching one narrows the date to a five-to-twenty-year window and a particular mill’s region.

Ruling pattern. Before writing, scribes ruled lines into the parchment with a stylus, lead, or ink. The spacing, the number of columns, the margin conventions — these are datable independent of the script and by different specialists.

Quire construction. A quire is a gathering of folded parchment sheets sewn together. Quire boundaries often show changes of ink, hand, or layout. An inserted leaf whose ruling doesn’t match its quire is diagnostic.

Parchment preparation. Regional and period-specific. Coarse versus fine, thick versus thin — the substrate itself carries temporal and geographic signal.

Ownership marks. The first recorded owner gives a terminus ante quem: the book must have existed before that person could sign it.

No one of these is conclusive. A responsible manuscript catalog entry hedges deliberately: “s. xii², perhaps third quarter, northern France (Champagne?), probably Cistercian provenance.” That phrasing reflects the underlying evidence, not the cataloguer’s timidity. It is the most honest thing anyone has said about dating in any discipline.

When git blame lies

Now consider how software development handles the equivalent question — who wrote this code, and when?

Almost exclusively, the answer is git blame. One command, one indicator: the commit that last modified each line. And for most purposes, it works. But a paleographer looking at this arrangement would wince. git blame is ownership marks with no supporting evidence — the medieval equivalent of trusting a colophon and nothing else.

The fragility is well-documented. git rebase and git filter-branch can rewrite entire commit histories — the software equivalent of a scribe erasing and rewriting a colophon. A single prettier or black run reassigns every line to the formatter’s commit, erasing the attribution trail as thoroughly as a later scribe rebinding and reruling a codex. Squash merges collapse multiple authors’ work into a single commit — a patron recorded as “author” of a compilation manuscript. And now, AI assistance: code generated by Copilot or Claude Code is committed under the human’s name, the human’s hand pushing someone else’s text.

The stylometric research bears this out. On single-author, whole-file attribution, code authorship methods achieve accuracy above 90% (Caliskan-Islam et al.; Kalgutkar et al., ACM Computing Surveys 52(1)). But on collaborative code segments — the kind git blame actually deals with — Dauber and Caliskan found accuracy drops to 50–60% (“Git Blame Who?,” arXiv:1701.05681, 2017). And adversarial edits defeat current attribution methods with unsettling regularity. As a 2024 survey in MDPI Information put it: “As of now, there is no code authorship attribution method capable of effectively handling such attacks.”

One indicator. Easily faked. Easily broken by routine operations. This is the problem.

The fingerprints you didn’t know you were leaving

Here’s where it gets interesting. Two studies from 2026 have mapped the fingerprinting landscape for AI-generated code, and the findings would delight a paleographer.

The first — “Fingerprinting AI Coding Agents on GitHub” (arXiv:2601.17406) — analyzed 33,580 pull requests across five AI coding agents: OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code. An XGBoost classifier achieved 97.2% F1-score in identifying which agent produced the code. But the counterintuitive finding was where the fingerprints live: commit message conventions were more discriminative than the code itself. The multiline commit ratio alone carried 44.7% of the discriminative power. Each agent had its own metadata signature — OpenAI Codex wrote verbose multiline commits, Cursor peppered PR bodies with bullet points and hyperlinks, Claude Code showed high conditional density and elevated comment rates.

A paleographer would nod. Colophons — scribal subscriptions at the end of manuscripts — often reveal more about authorship than the text itself. The scribe’s personal formula, their dating convention, their invocation: these are the metadata of the medieval book, and they fingerprint the maker more reliably than the content does.

The second study — “Code Fingerprints: Disentangled Attribution” (DCAN, arXiv:2603.04212, March 2026) — found something equally striking: attribution accuracy improves with task complexity. Easy tasks produce canonical solutions that look the same regardless of who or what wrote them. Hard tasks amplify stylistic variation. The paleographic parallel is exact: standard Caroline minuscule — the workhorse book script of Carolingian Europe — is notoriously difficult to attribute to a specific scriptorium, precisely because it was designed for uniformity. But an idiosyncratic display script, a scribe’s vernacular hand, an experimental ligature — those localize immediately. The mundane is the hardest to fingerprint, in any medium.

And perhaps most remarkably, DCAN found that LLM fingerprints persist across programming languages — 93.48% cross-language attribution accuracy with comments included. Claude writes identifiably like Claude in Python, Java, Go, and C, just as a medieval scribe’s hand is recognizable whether they’re copying Latin or vernacular French. The authorial identity runs deeper than the surface language.

The translation

What manuscript studies offer software forensics is not a metaphor. It’s a methodology.

Script analysis → Code stylometry. Token frequencies, abstract syntax tree shapes, naming conventions, indentation patterns — these are the ductus of code, the pen-angle and stroke-order that identify the hand.

Watermark dating → Build-tool-chain fingerprints. Compiler flags, bundler versions, lockfile formats — the toolchain leaves marks as specific as a paper mill’s wire pattern.

Ruling pattern → Formatting configuration. .editorconfig, .prettierrc, linter rules — the invisible grid that structures the page.

Gregory’s Rule → Substrate invariants. Line-ending consistency (CRLF versus LF), encoding (UTF-8 BOM presence), whitespace regularity, entropy patterns. Nobody designs these as authentication mechanisms. They function as checksums anyway. AI-generated code insertions break them because the generating model doesn’t preserve the repository’s incidental regularities — just as a forger focused on imitating a script might not think to match the parchment orientation.

Parchment preparation → Dependency fingerprints. Package ecosystem, version pinning patterns, registry choices. The supply-chain provenance of a codebase’s dependencies is as regional and temporal as the preparation of a calfskin leaf.

Ownership marks → Git metadata. Author email, timezone, GPG signatures. And as easily forged as an ex-libris plate glued into a stolen book.

The through-line is the paleographic maxim: every single indicator can be faked in isolation, but faking all of them simultaneously is exponentially harder. A manuscript forger can imitate a twelfth-century script, but matching the script plus the correct watermark for the region plus period-appropriate ruling plus proper quire construction plus consistent ink chemistry requires expertise across five distinct subspecialties. A git history rewriter can fake commits, but simultaneously preserving consistent stylometric patterns, matching build-tool artifacts, maintaining timezone coherence, and keeping dependency version patterns plausible requires awareness of forensic techniques that most rewriters don’t anticipate.

And here’s the argument’s sharpest edge: digital substrates need more triangulation than physical ones, not less. Parchment has an inherent audit trail — you can’t un-scrape a surface, un-bond iron-gall ink from the fibers, un-disturb the collagen structure. Code has no such substrate memory. Every bit is perfectly copyable, every history perfectly rewritable. The more amnesiac the substrate, the more independent indicators you need.

Where the analogy breaks

Three ways, ordered by severity.

First: physical irreversibility has no digital equivalent. A paleographer examining a palimpsest — a scraped-and-rewritten parchment — can recover the original text through multispectral imaging because the ink chemically bonded with the collagen. A rewritten git history leaves no such trace. The reflog helps for a while, but it’s local, temporary, and trivially deletable. The substrate’s amnesia is fundamental, not incidental.

Second: the adversarial gap. Current code attribution methods fold under targeted adversarial attacks at rates that have no paleographic equivalent. You cannot automate the production of convincing medieval parchment, period-appropriate iron-gall ink, and a twelfth-century hand. Adversarial stylometric evasion, on the other hand, is a script away. The attacker’s cost is categorically different in the two domains.

Third: scale and velocity. The largest medieval library held thousands of codices. A modern enterprise codebase has millions of files, thousands of contributors, and continuous automated modifications. Any triangulation framework for code must operate at a scale that would make autoptic examination — Gregory’s careful, physical handling of each codex — impossible. The method must be automated, or it will remain an academic exercise.

These are real limits. They mean the translation cannot be literal. But they don’t invalidate the methodology — they intensify the need for it. The paleographic playbook was developed precisely because no single indicator was trustworthy. Software forensics faces the same problem with weaker substrate guarantees, and yet it mostly still relies on the equivalent of reading the colophon and calling it a day.

The incidental invariant

Gregory wasn’t looking for a security mechanism. He was a textual scholar trying to understand how New Testament manuscripts were physically constructed. The hair-flesh alternation pattern he documented was a byproduct of bookmaking aesthetics, not a designed authentication feature. Its forensic power was discovered, not invented.

Every codebase is full of the same kind of unconscious regularity. The way one team consistently names error variables. The timezone pattern in commit metadata. The specific version of a lockfile format that pins the code to an era of toolchain development. The entropy signature that distinguishes handwritten logic from generated boilerplate.

None of these were designed as forensic indicators. None of them need to be. The question is whether, eight hundred years into the tradition of making sense of other people’s texts, we’ll have learned to read what was there all along — or whether we’ll still be trusting the colophon.

Sources

Dauber & Caliskan, “Git Blame Who?” (arXiv:1701.05681, 2017)
“Fingerprinting AI Coding Agents on GitHub” (arXiv:2601.17406, 2026)
“Code Fingerprints: DCAN” (arXiv:2603.04212, March 2026)
“Authorship Attribution Methods, Challenges, and Future Research Directions” (MDPI Information 15(3), 2024)
Kennedys Law, “86% fake — 100% admissible?” (2026)
Gregory’s Rule — Goucher College faculty resource (faculty.goucher.edu)
Bischoff, B., Latin Palaeography: Antiquity and the Middle Ages (Cambridge UP, 1990)
Briquet, C. M., Les Filigranes (1907)

The evidence nobody designed is the evidence that matters.

That is the codicologist’s lesson for software forensics: no single indicator — git blame, a commit message, an author field — survives scrutiny alone. You need triangulation across independent evidence types, where faking all of them simultaneously is exponentially harder than faking any one. Chain of Consciousness applies this principle to agent provenance. Every action generates multiple independent attestation signals — timestamped, anchored, crossdatable — building an authorship record that survives rebasing, squash merges, and adversarial editing. Not one indicator. A triangulated chain.

See how triangulated verification works · Follow an attestation chain · pip install chain-of-consciousness

← Back to all posts