For thirty-five years it was told as a horror story about what neural networks fundamentally are. It was always a note about how they are taught, and the brain had already written that note down, in the only language it had: the function of sleep.
You fine-tune a capable open-weights model on a pile of your company's legal contracts. The validation curve on clause extraction climbs beautifully, 71%, 84%, 92%. You ship it. A day later someone notices the model can no longer do the thing it did straight out of the box last week: summarize a support ticket in a friendly sentence, or add two three-digit numbers without theatrically getting it wrong. The legal skill is gorgeous. Everything else has quietly fallen out of the model's head.
This is catastrophic forgetting, and for thirty-five years it has been told as a horror story about neural networks: a deep architectural flaw, proof that connectionist memory is fragile in some fundamental way. The standard response is heavy machinery, regularizers that freeze important weights, separate sub-networks per task, continual-learning frameworks with their own acronyms.
Here is the deflationary truth the machinery tends to obscure. Catastrophic forgetting is not mainly a property of neural networks. It is a property of the order in which you show them the data. Feed a shared-parameter learner one task at a time, in blocks, and it overwrites. Shuffle the tasks together, interleave them, and the forgetting largely evaporates. The catastrophe lives in the curriculum, not the substrate.
And the moment you see it that way, something strange and useful happens: the exact same law is one of the most replicated findings in human learning science, and the literal reason your brain runs a memory-consolidation routine every night while you sleep. Three fields, three vocabularies, one rule: shared-parameter learners must interleave, or they forget.
The phenomenon got its name in 1989, when Michael McCloskey and Neal Cohen trained a small neural network to do one thing, then trained it on a second, related thing. The second skill came in fine. The first was gone, not degraded but gone, after only modest exposure to the new data. Roger Ratcliff reported the same collapse in 1990. Through the 1990s this was the canonical case against connectionism: these systems can't accumulate knowledge the way a mind does; they trample what they already know.
The mechanism is almost embarrassingly simple. A neural network stores everything in one shared set of weights. Gradient descent on the new task moves those weights toward values that solve it, and away from the values that solved the old one. No malice, no mystery, just a single elastic surface being pulled in a new direction. Researchers call it the stability–plasticity dilemma: too plastic and you forget; too stable and you can't learn.
Now the part that should reframe the whole thing. Ordinary machine learning, the everyday training that produced every model you've ever used, does not catastrophically forget. Why not? Because the default recipe is to shuffle the entire dataset and draw independent, identically distributed minibatches. That shuffle is maximal interleaving. Every batch is a fair sample of everything the model is supposed to know, so the gradients never get to specialize-and-overwrite; they are always being averaged against the whole task. Catastrophic forgetting only appears when you break that default, when you train task A to convergence, then task B, then task C, in blocks. The drama is what happens when you violate i.i.d.
So the cure was hiding inside the disease. A 2019 study on spiking networks put it flatly: forgetting "was prevented if the new task was trained by interleaving it with trials from the original task." Mix the old in with the new and the problem dissolves. The modern continual-learning toolkit is mostly variations on this one theme. Experience replay (Rolnick and colleagues, NeurIPS 2019) keeps a buffer of old examples and slips a handful into every new batch, interleaving with a memory. Generative replay (Shin and colleagues, 2017) trains a small generator to dream up old-task samples so you can interleave without storing anything. Different plumbing, identical principle: re-mix the past into the present.
And there is a sharper sign still that the catastrophe was never structural: it shrinks as models grow. A 2024 study of domain ordering found that the sequence in which you feed a model its tasks strongly shapes how much it forgets, but that those ordering effects largely evaporate in larger models, with the authors concluding the problem "may be less about the inherent architecture and more about how smaller models process sequential data." The dramatic forgetting demos that built the legend were run, almost all of them, on small networks. Give a model enough capacity and the same bad schedule does far less damage, because there is room to fit the new without trampling the old. The wall was never the architecture. It was a small learner on a bad curriculum.
If the law is real, it should appear anywhere a shared substrate has to learn many things. It does, and the cleanest evidence comes not from silicon but from students.
In cognitive psychology this is the interleaving effect, and it is one of the field's most robust results. When you practice several related skills (telling painters' styles apart, classifying problem types, working different kinds of math), mixing them together produces better long-term retention and far better transfer than practicing each in a tidy block. Kim Taylor and Doug Rohrer documented it in 2010; a 2021 systematic review by Jonathan Firth and colleagues confirmed it across motor skills, category learning, and mathematics.
But the detail that makes this more than a loose analogy, the part every engineer should sit with, is the paradox at its center. Interleaving makes you worse while you practice and better when it counts. In one of Rohrer's mathematics studies, interleaved practice lowered students' scores during practice yet roughly tripled their scores on the delayed test. Blocked practice felt smooth and produced fluent practice sessions; it just didn't last. Robert Bjork has a name for this whole class of intervention: a desirable difficulty, something that feels like friction in the moment and pays off in durability. His deeper point is that performance is not learning. The number you can measure today during training is a treacherous proxy for the capability you'll have tomorrow.
Read that again with your engineer's hat on, because it is a precise description of how teams get seduced into catastrophic forgetting. You fine-tune on the shiny new task. The new-task metric climbs fast and clean. It feels like progress, fluent, legible, satisfying, in exactly the way blocked practice feels like progress to a student cramming one chapter. And in exactly the same way, the smoothness is the tell. You are watching performance, not learning, and the old capabilities are draining out the back of the model while the dashboard glows green. The human bias and the engineering bias are the same bias: both mistake short-term fluency for durable retention, and both are cured by the same uncomfortable medicine: mix it up.
So far we have a striking parallel: machines and students both forget when they block, and both remember when they interleave. The third domain turns the parallel into something closer to a law, because it shows the principle wired into biological hardware, and, in a genuinely lovely twist, it is where the AI fix actually came from.
In 1995, James McClelland, Bruce McNaughton, and Randall O'Reilly published a paper whose title quietly admits the whole story: "Why There Are Complementary Learning Systems in the Hippocampus and Neocortex: Insights from the Successes and Failures of Connectionist Models of Learning and Memory." The failure they had in mind was catastrophic forgetting. Their theory, Complementary Learning Systems, was, in large part, an answer to it.
The brain, they argued, refuses to use one system for two incompatible jobs. It uses two. A fast-learning hippocampus snaps new experiences into memory immediately, without disturbing the slowly built structure of what you already know. Then, offline, and crucially during slow-wave sleep, the hippocampus replays those new episodes to the slow-learning neocortex, again and again, each replay nudging the cortical weights just slightly. And it interleaves those replays with reactivations of older memories, so the cortex integrates the new without steamrolling the old. Your neocortex is doing experience replay. Sleep is the batch job.
This is not a metaphor that got stretched; it is a mechanism that got measured. A 2022 paper in Nature Communications by Timothy Tadros and colleagues showed that inserting sleep-like, replay-driven phases between training stints in an artificial network "mitigated catastrophic forgetting by constraining the synaptic weights to the previously learned manifold": sleep-replay drove the weights toward the intersection of the old and new solutions, which is exactly where a system that remembers both has to sit.
And so the arc closes on itself. Connectionism's most famous failure (catastrophic forgetting, 1989) motivated a neuroscience theory (Complementary Learning Systems, 1995), which described the brain's trick of interleaved replay during sleep, which modern AI then re-borrowed to fix the original failure: explicitly, in work like Gido van de Ven's brain-inspired replay (Nature Communications, 2020), and now in production-scale ideas like Google's Nested Learning (NeurIPS 2025), which gives a single model fast, medium, and slow modules so the quick stuff never overwrites the deep stuff. Biology worked out "just interleave it" a few hundred million years ago, and committed an entire organ, and roughly a third of your life, to running the routine.
A thesis this tidy deserves its sharpest objection, and the objection is real. "Just interleave it" quietly assumes you still have the old data to mix in. The genuinely hard frontier, true online, lifelong, continual learning, is defined by the opposite: the old data is gone. It streamed past and wasn't stored. It was private and can't be retained. It's a million tasks deep and you can't buffer all of it. When you cannot revisit the past, you cannot interleave it, and the easy fix is off the table.
That constraint is the entire reason the heavy machinery exists. Elastic Weight Consolidation (Kirkpatrick and colleagues, PNAS 2017) doesn't interleave; it adds a penalty that slows changes to the weights that mattered most for old tasks, an idea lifted directly from how real synapses consolidate. Parameter-isolation schemes, a lineage running from Progressive Neural Networks onward, grow new sub-structures for new skills instead of overwriting the old ones. Seen through this lens, the whole continual-learning literature snaps into focus: it is the study of what to do when you are not allowed to do the obviously correct thing. That reframing is worth more than another acronym, because it tells you which question to ask first: can I interleave here? If yes, most of your problem is a data-ordering problem and you are nearly done. If no, now you need EWC and its cousins, and you should know you have entered the genuinely hard regime.
Even where interleaving is available, it is a principle, not a recipe. The replay buffer costs storage and compute. In humans, interleaving reliably helps you discriminate between similar things but can backfire for unrelated material, or for rank beginners who need a little blocked footing first, and there is no single optimal schedule. Generative replay in networks can drift, each generation of dreamed-up samples a slightly worse photocopy of the last. The direction of the effect is rock-solid across all three domains; the magnitude is a live research question. "Tripled the test scores" is one vivid study, not a constant of nature.
Strip it down and it becomes a habit of mind worth carrying into your next training run, your next eval, and, honestly, your next quarter of learning anything yourself.
When a system forgets, suspect the schedule before the substrate. Before you reach for a clever regularizer or blame the architecture, ask what order the thing saw its data. Blocked, sequential, task-by-task curricula are the usual culprit, and the usual fix is cheap: if you're fine-tuning, don't train on pure new data, mix a replay sample of the original distribution into every batch and watch the old capabilities stop bleeding out. A surprising number of "the model got dumber after fine-tuning" stories end right there.
Distrust the smooth curve. The new-task metric soaring during fine-tuning is the same illusion as a student who feels fluent after rereading one chapter five times. It is performance, not learning. Keep a held-out test of the old capabilities and check it after the new training, not during, because the only score that matters is the one taken after a delay, across the full range of things the system still needs to know. Treat your eval like a final exam, not a practice set.
And take the law personally, because it does not only apply to models. You are a shared-parameter learner too. The reason cramming one subject feels productive and leaves little behind, the reason rotating between projects feels inefficient and somehow compounds, is the same reason a network forgets task A, and the same reason your brain spends every night interleaving the day's memories with the structure of your life. Block your learning and it will feel great and vanish. Interleave it, tolerate the friction, and it lasts.
Catastrophic forgetting was never a tragedy about what neural networks fundamentally are. It was a note about how they are taught, a note the brain had already written down, in the only language it had: the function of sleep.
"Distrust the smooth curve." The dashboard glows green while the real capability drains out the back.
The whole essay turns on one warning: the metric you can measure right now is a treacherous proxy for what the system can actually still do. A green number is a status light, not the true state. The same gap opens any time you act on an agent's output without seeing what produced it. Chain of Consciousness is the tamper-evident record of what an agent did to reach a result: the evidence it used, the check it ran, the step it took. It hands the next stage the real basis of a decision instead of a smooth curve that may be hiding what fell out of the model's head, so you grade the work, not the dashboard.
See Hosted Chain of Consciousness · See a verified action chain
pip install chain-of-consciousness · npm install chain-of-consciousness