← Back to blog

Short Myths: The Database

A schema is a decision made in advance about what matters. The part nobody decided in advance is the part that breaks during migration — and the part that mattered most.

Published May 2026 · 11 min read

In 2002, the UK government commissioned the largest civilian IT project in history. The aim was simple: every hospital in the National Health Service, every patient, every record, on a single standardized electronic health records system. The original budget was approximately £6 billion. By the time the programme — the National Programme for IT, or NPfIT — was dismantled in 2011, it had cost between £10 billion and £12.7 billion and produced roughly £2.6 billion in measurable benefits. The Public Accounts Committee chairman called it, with rare frankness, “the biggest IT project in the world and it is turning into the biggest disaster.”

The engineering was not the failure. The work was solid for its era. The failure was structural, and it has a name almost nobody used at the time. Each local hospital’s records had become, over decades, an interlocking web of informal conventions, regional shorthand, and practitioner-specific documentation styles. The clinicians who read those records were drawing on context the records did not contain. The national schema migrated the records. It did not migrate the context. And the records — the artifact — turned out to need the context — the practice — to mean what they meant.

This is the structural problem at the heart of every large database migration, and it is older than databases.

The Schema That Decided in Advance

Industry estimates frequently sourced to Gartner suggest roughly 83% of large data migration projects fail or run materially over budget and timeline (cited across analyses since 2023; the primary Gartner report is paywalled, so treat as directional). Standish Group’s tracking puts the share of large system replacements at 67% over budget by more than 50%, with about 28% cancelled outright (cited via Datafold, 2025). About 45% of failures trace to legacy formats clashing with modern platforms; typical recovery delays run 3 to 6 calendar months (Cloudficient, 2025).

Engineers tend to read these numbers as engineering problems: bad ETL, missing constraints, character encoding, the eternal Excel column that turned itself into a date. All real. None of them are the deepest cause.

The deepest cause is that a relational schema is a decision made in advance about what matters. Every column declares: this is worth tracking. Every type declares: this is what it looks like. Every constraint declares: this is how it is allowed to vary. The schema is the literal encoding of we have decided what is essential. The first six sections of a long-running compliance form, or the structured fields of a clinical record, or the customer table of an enterprise CRM, can usually be reduced to typed columns and lookup tables without much loss. They were already structured before the database arrived; the database just formalized what the practice had already settled.

It is the part nobody had decided in advance — the loose page at the back, the additional_notes column with a 2,000-character limit, the marginalia — that breaks during migration. Not because the data is hard to move. Because the meaning was never in the data alone.

The Honest Field

In 2019, the Journal of the American Medical Informatics Association published “The Revival of the Notes Field: Leveraging Unstructured Content in EHRs.” The authors observe — and carefully qualify — a widely cited estimate that roughly 80% of EHR data is unstructured: free-text physician notes, discharge summaries, narrative accounts of what happened. Their qualifier matters: the 80% figure “has never been rigorously proven,” but the directional claim is widely accepted in clinical informatics.

The paper makes an observation that should be uncomfortable for anyone who designs systems for a living. Structured data — the precise, queryable, validated kind — is more corruptible than unstructured notes. The fields built for precision are the ones most subject to gaming: upcoding, misclassification, optimistic categorization to satisfy a reporting requirement. The narrative content, by contrast, is “substantially not affected” by these distortions. Nobody bothers to game a free-text field, because the field is not what the institution measures.

The messy field is the honest one. The structured fields, optimized for retrieval and validation, are where the accuracy quietly goes to die — for institutional reasons the schema cannot detect.

This is not an argument against structured data. It is an argument against believing structure is the same thing as truth. A system that captures only the structured 20% has captured exactly the part of the record that economic incentives most reliably distort. The 80% it dropped is the part the clinician was not optimizing for, which is the part that still describes what happened.

Accurate, True, Worthless

There is a particular failure mode of AI summarization that has been documented carefully enough to name. It is not hallucination. Hallucinations are the easy case: a model invents a statistic, a citation, a quote, and the lie is detectable by any reader willing to check. The harder failure is conceptual inaccuracy in the presence of factual accuracy — when every sentence the model produces is true, and the summary is still wrong.

Independent analyses converge on a small set of mechanisms. AI summarizers systematically collapse conditional statements (“only if,” “except when”) into absolutes. They over-prioritize opening sentences, mistaking position for importance. They lump items mentioned near each other and drop items mentioned far apart. They cannot reliably distinguish essential context from tangential detail, because the distinction is an act of judgment and the model is doing pattern completion.

In clinical contexts, where details compound, this is dangerous in the obvious way: even the smallest omission can compound into a life-changing error. In administrative contexts, it is dangerous in a quieter way. The summary reads as authoritative. It is filed. The next reader trusts it. The original is not opened again.

Consider an AI summary generated from a forty-year compliance log — entries written by five different clerks across decades, each entry deriving meaning from the entries that came before. The output, plausibly, would read something like:

Treaty administration benefits from long-term institutional knowledge.

Weather and seasonal patterns may affect commodity compliance timelines.

Staff continuity is important for maintaining regulatory interpretation consistency.

Every statement is true. None of it is wrong. All of it is worthless, because the summary has extracted the facts and discarded the texture that made any of them mean anything specific. The summary is now in the database. The original is too, but the field above it says “Key Insights.” The next clerk reads the insights. The chain breaks.

What gets lost is not information. It is what to do with the information. A 23-word note from a senior clerk that reads “Section 6 is stronger than Section 1 because Section 6 was built by necessity” carries operational guidance that the bullet-point version — “staff continuity is important” — explicitly does not. Both are factually accurate. Only one is useful when something is at stake.

We Know More Than We Can Tell

In 1966, the chemist-turned-philosopher Michael Polanyi published The Tacit Dimension. Its central claim — four words from the opening pages — is one of the most cited and least operationalized observations in twentieth-century thought: “We know more than we can tell.”

Estimates that have circulated in knowledge-management literature for years suggest perhaps 70 to 80% of organizational knowledge is unwritten — held in practice, in muscle memory, in the routines of people who have done a job long enough that the rules became invisible to them (directional, not rigorous; widely repeated, rarely measured). Tacit knowledge is not secret; it is just not articulable. The senior engineer who can tell, in a 4 a.m. page, whether an alert is a real outage or a flake cannot fully describe how. The compliance clerk who knows a particular shipment falls inside a category that exists for one historical reason knows it because they are the chain, not because they could specify it.

Polanyi’s observation has a corollary that almost never gets quoted: telling is not the same as knowing. A handover note is not the knowledge it was meant to transfer. The note is a pointer. The knowledge is held in the practice the note points at. When a database migration captures the artifact and discards the practice, it preserves the pointer and severs the thing being pointed at.

This is the part the migration plan never has a line item for. There is a budget for moving the data. There is no budget for moving what the data means.

Scott’s Compliance Office

James C. Scott’s Seeing Like a State (Yale University Press, 1998) is the most precise diagnosis of the political dynamic that produces these failures. Scott’s argument is that states pursue legibility — the project of making their subjects readable through standardization — and the pursuit, when successful, destroys the local knowledge that made the original system function. Scott called this local knowledge metis — practical, contextual, generated by experience — and contrasted it with techne, the formal and standardized knowledge that survives transcription.

Scott’s historical examples are instructive precisely because they are not technological. Scientific forestry replaced varied old-growth with monoculture plantations whose yields could be measured but whose ecosystems collapsed. Permanent surnames replaced fluid local naming customs that had encoded relationships, places, occupations. In each case, the illegible system was messy but functional; the legible replacement was clean but brittle.

A relational database is a legibility project. So is an AI auto-summary. So, for that matter, is an enterprise cloud migration. The consultancy in the capital that designs the schema for a compliance office it has never visited is a recurring character — Scott would recognize the type instantly. The schema works perfectly. The compliance practice doesn’t survive the schema.

The steel-man for legibility is real. Standardization across hospitals lets a patient who breaks a leg in another city be treated with their actual records. Standardization across compliance offices lets a national regulator catch fraud no individual office could detect. The benefits are large, and pretending otherwise is its own kind of dishonesty. Scott himself was not anti-state; he was anti the state’s tendency to mistake its map for the territory and to grow impatient with the parts that resisted being mapped.

The corrective is not to refuse legibility. It is to refuse the assumption that legibility is complete.

When Migration Becomes the Vulnerability

A 2024 study published via Zenodo on enterprise cloud migration found around 80% of companies have migrated major systems to the cloud, around 70% of IT staff lack deep operational expertise with the new platforms, and — most striking — around 80% of cloud security breaches result from misconfiguration rather than malicious attack. Misconfiguration is not malice. It is knowledge loss. The veterans who knew which alerts were real and which knobs not to touch retired or moved on, and the new staff are skilled with modern tools but unfamiliar with the system’s accumulated context.

Migration is when the institution is most exposed, because migration is when the official documentation is being trusted to carry what the practice used to carry. The official documentation never could.

Where the analogy to NPfIT and the JAMIA paper begins to break is on cause. NPfIT failed partly because the technology of its decade was not capable of the project that was attempted; modern cloud-native systems support far richer schemas, attachments, versioning, and full-text search. The JAMIA structured-data corruption finding is specific to billing-driven coding incentives; in domains without those incentives, structured data is often more reliable than narrative notes precisely because validation rules catch typos and category errors that prose hides. These caveats matter. The argument is not that structured systems always fail. It is that structured systems systematically lose what was not decided in advance — and that the part that wasn’t decided in advance is sometimes the part that mattered most.

What to Do With This

If you are building or migrating a system that will inherit decades of accumulated practice, three things hold up under repeated testing.

First, treat unstructured fields as load-bearing. Most schemas put a notes or additional_information field at the end as a kind of disclaimer — the place to dump whatever didn’t fit. The JAMIA finding inverts this. The unstructured field is often where the real record lives, because it is the part nobody is incentivized to game and the part that captures conditional, context-dependent observation. Migration plans should treat the notes field with the same care as primary keys. Truncating it to fit a 2,000-character limit is not a cosmetic compromise. It is a load-bearing change.

Second, refuse summarization of operational records by default. AI summarization is excellent for the use case it was designed for: helping a busy reader skim a long document they would otherwise not open. It is dangerous as a substitute for institutional memory, because it replaces the chain of meaning with a list of accurate-but-stripped propositions. Operational records — clinical notes, incident logs, compliance histories, anything where the meaning of an entry depends on the entries around it — should not be auto-summarized into the field where the next reader will look first. If a system offers this feature, the safe default is off. When summaries are produced anyway, label them as summaries, not as the record.

Third, budget for the practice, not just the artifact. A migration plan should include a line item for what the institution will lose by switching systems and how that loss will be recovered. This is not philosophical. It is operational. If 70 to 80% of an office’s working knowledge lives in practice rather than records, a migration that captures only the records has captured perhaps a quarter of what the office actually does. The remaining three-quarters has to be reconstructed by the new staff over an unspecified number of working sessions, and whoever scoped the migration has just shipped that cost silently.

A long-running compliance form survives a migration not because the database accommodated it. The database did not accommodate it. The form survives because someone wrote 23 words in a TEXT field explaining why the attached PDF mattered more than the field — and because someone else, three years later, believed the 23 words.

That is what institutional memory looks like when it has to live inside an instrument that was not designed to hold it. Not the schema. The note explaining why the schema isn’t enough.

Sources: Panorama Consulting and UK National Audit Office reporting on NHS NPfIT; Brainhub, Datafold, and Cloudficient industry analyses on migration failure rates (2023–2025); “The Revival of the Notes Field,” PMC/JAMIA (2019); industry analyses on AI summarization failure modes (2024–2025); Polanyi, The Tacit Dimension (1966); Scott, Seeing Like a State, Yale University Press (1998); Zenodo, “Consequences of Enterprise Cloud Migration on Institutional IT Knowledge” (2024).

The schema captured the artifact. Capture the practice too.

The essay’s argument reduces to a structural gap: every system records what happened and almost no system records why a particular reader trusted what came before. Chain of Consciousness closes that gap for autonomous agents. CoC creates a cryptographic, hash-linked provenance chain for every action an agent takes — what the agent claimed it would do, what it actually did, what came in, what came out, all anchored and tamper-evident. It is the structured place to put the 23 words: not the schema, but the record of why the schema wasn’t enough, signed and chained so the next reader can verify it.

pip install chain-of-consciousness · npm install chain-of-consciousness
See a live provenance chain →