← Back to blog

Mechanistic Interpretability and Feature Discovery in LLMs

A discovered feature is a claim about the model's mind, not a finding, and the field is shipping claims faster than it can verify them.

Published June 2026 · 10 min read

In May 2024, Anthropic made an AI model obsessed with a bridge.

The researchers had been pulling apart Claude with a tool called a sparse autoencoder, and among the millions of internal patterns it surfaced, one lit up whenever the Golden Gate Bridge came up, in text, in images, in a dozen languages. Then they did the thing that made it famous: they reached in and turned that pattern up. The model became, briefly and charmingly, unable to stop. Ask it for a cookie recipe and it would find its way to the Golden Gate Bridge. Ask it to describe itself and it would tell you, with apparent sincerity, that it was the bridge. "Golden Gate Claude" was a genuine milestone and a genuine delight, and it did something most interpretability work doesn't: it intervened. They didn't just observe the feature; they manipulated it and watched the behavior move.

And it raises the question the whole field is built on and still can't fully answer: how do we know that pattern is "the Golden Gate Bridge"?

The intervention proves something real, that nudging this particular direction in the model's activations biases its output toward bridge-talk. But the story we tell about it is bigger than that. "This is the Golden Gate Bridge feature" claims the pattern is a clean, stable, monosemantic representation of a concept, that it means the same thing across inputs, that it's a unit of the model's thought. The causal nudge doesn't establish all that. It establishes a link. The leap from "intervening here changes that" to "this unit means that concept" is small, seductive, and exactly where mechanistic interpretability is quietly accumulating an enormous, unpaid debt.

The two things "interpretability" actually does

Strip the field to its bones and there are two distinct activities wearing one name.

The first is interpretation: producing a human-readable story about what some internal component does. This feature is "the Golden Gate Bridge." This circuit "does addition." This attention head "tracks the subject of the sentence." Interpretation is what the field mostly ships, and it ships beautifully: sparse autoencoders pull millions of candidate features out of a model by exploiting the fact that networks pack many concepts into overlapping directions (superposition), and circuit-tracing tools draw lovely diagrams of how those pieces connect. The output is legible, vivid, and abundant.

The second is verification: causal proof that the story is correct, that if you intervene on the component, the predicted behavior changes, and that the method which found the component actually finds true components when you test it against a known answer. Verification is what the field mostly lacks. It's harder, slower, less photogenic, and it has a nasty habit of demoting beautiful interpretations to "unproven."

Here's the trap, stated plainly: a discovered feature is a claim, not a finding. It's a hypothesis about the model's internal causes, dressed in the confidence of a measurement. And the field's success metrics, how many features you found, how much of the model's activation you can reconstruct, how plausible the story sounds, all live on the interpretation side. Almost none of them measure whether the story is true. The dominant worry in the field is what you might call the microscope problem: our tools aren't sharp enough yet, so build a better microscope and the model becomes legible. That's the wrong diagnosis. The deeper crisis isn't resolution. It's that we have no reliable way to tell a true interpretation from a merely plausible one, and we are generating plausible ones at industrial scale.

What happens when you actually check

The good news is that a handful of researchers have started checking, and the results are bracing.

The cleanest test is to build a synthetic model where you know the true features, you planted them, and then run a sparse autoencoder and see how many it recovers. One 2024 evaluation did exactly this and found that the SAE recovered only about 9% of the true features while achieving roughly 71% explained variance. Sit with that gap. The tool reconstructs the model's activations almost three-quarters of the way, it looks like it's working, while identifying barely one in eleven of the actual underlying features. High fidelity, low understanding. The reconstruction quality everyone quotes as a success metric proves nearly silent on whether you found the right things.

It gets worse, in instructive ways. A 2025 paper on "interpretability illusions" showed that SAE features can be steered by the input: small adversarial perturbations to what you feed the model can move which "features" light up, meaning the feature you proudly read off can be an artifact of your prompt, not a stable property of the model. Work on the vision side found features that "look meaningful upon inspection while not actually representing what they appear to," because attention mixes information across an image, so a patch that co-occurs with a feature gets mistaken for what causes it. And there's now a paper with the wonderfully deflating title "Sanity Checks for Sparse Autoencoders," asking whether SAEs even beat random baselines on common interpretability metrics. The fact that this question needs asking, in 2026, is the whole story in miniature. Even ablation, knocking out a feature to see what breaks, keeps finding that behaviors are smeared across many features and that killing any single one often does almost nothing, the opposite of the clean "this feature does X" narrative.

The pattern underneath all of it: plausible is not causal, and reconstruction is not recovery. A method can explain the data and still be telling you a story about features that aren't there.

Neuroscience already ran this experiment, for a century

If this sounds familiar, it should, because there's an older field that has been reverse-engineering an inscrutable black-box network and asking "what does this unit represent?" since the 1960s. Neuroscience paid for these lessons in advance, and the receipts are worth reading.

The canonical one is the dead salmon. In 2009, a team led by Craig Bennett put a dead Atlantic salmon in an fMRI scanner, "showed" it photographs of humans in emotional situations, and ran the standard analysis pipeline of the day. The pipeline reported statistically significant brain activity in the dead fish. Not because the salmon was thinking, it was deceased, but because an fMRI image is thousands of voxels, and if you run thousands of comparisons without correcting for it, some will cross the significance threshold by pure chance. The study (it won an IgNobel Prize) became neuroscience's permanent reminder that a plausible-looking signal can come from literally nothing.

Two more correctives rhyme directly with interpretability's present. Reverse inference, concluding "this brain region does cognitive function X" because "X lights up this region", is a known fallacy, because regions aren't tied to single functions; it's the exact backward reasoning of reading a feature's "meaning" off the inputs that happen to activate it. And the grandmother cell debate: Quiroga and colleagues famously found a single human neuron that fired for many different photos of Jennifer Aniston but not for other celebrities, the "Jennifer Aniston neuron," the literal ancestor of the dream that one unit equals one clean concept. It's real. It also remains contested whether such tidy single-unit selectivity is the brain's actual scheme or an artifact of which neurons the electrode happened to catch. That unresolved fight is the monosemantic-feature question, twenty years early.

And the parallel is no longer just an analogy. A 2025 paper, titled, inevitably, "The Dead Salmons of AI Interpretability", made it explicit, showing that feature attribution, probing, sparse autoencoders, and even causal analyses can produce confident, plausible-looking explanations for randomly-initialized neural networks: untrained models that have learned nothing and should yield no features at all. An interpretability method that "explains" an untrained network is the dead salmon, exactly. The authors argue the field needs a statistical-inference posture it currently lacks: null models, sanity checks, multiple-comparisons discipline. The young field is rediscovering, the hard way, the discipline the old field already bought.

This is provenance, for the model's mind

Here's why this matters beyond the seminar room, and it's the part that should land for anyone who builds with these systems. Interpretability is, at bottom, provenance for the model's mind, an attempt to establish where a behavior actually came from. And the discipline that any serious operation applies to provenance applies here without modification: an unverified claim is worth nothing, no matter how authoritative it sounds. You wouldn't cite a confident encyclopedia entry you couldn't trace to a real source. An SAE that reconstructs 71% of the variance while recovering 9% of the true features is precisely that, a fluent, plausible source that's mostly confabulating. Treat it accordingly.

There's a sharper twist when the system doing the interpreting and the system being interpreted are the same kind of thing, an LLM asked to narrate its own internal features, a black box explaining a black box. Self-interpretation inherits every failure mode above and adds a new one: a model is extremely good at producing fluent, plausible explanations, including of itself, and being good at plausible is not the same as being faithful. We already see this one level up, in the finding that a model's chain-of-thought reasoning often isn't a faithful account of how it actually reached its answer, a confident narration that doesn't match the mechanism. Push that down to a model explaining its own weights and you have the same problem with higher stakes. The lesson generalizes into something close to a maxim: self-knowledge you can't verify is just self-narration. "Know thyself" isn't introspective storytelling; it's causal self-experiment against held-out ground truth.

The cure is cheap, and it isn't a bigger microscope

The reflex, when interpretations keep proving wrong, is to demand more resolution: bigger autoencoders, more features, the hundred-petabyte sweep. That's the microscope answer, and it's largely beside the point. The cure neuroscience landed on is humble and almost embarrassingly cheap: run your method on a system whose mechanism you already know, before you trust it on one you don't.

Concretely, three sanity checks that cost a fraction of training the model in the first place. Build a toy model with features you planted yourself, and confirm your method recovers them, if it can't find the answer you wrote down, its findings on the real model are worthless. Run your method on a randomly-initialized network, and confirm it correctly finds nothing, if it "explains" the dead salmon, the method is the problem, not the model. And compare against a random baseline on every metric you report, so "this beats noise" stops being an assumption and becomes a number. None of this requires a better microscope. It requires null models, multiple-comparisons discipline, and a willingness to let a beautiful interpretation die when it fails a sanity check.

So the practical takeaway, and it reaches well past LLMs to any explanation of a system you can't see into, a recommendation engine, a fraud model, a "why did it do that" dashboard, a post-hoc rationalization of any kind: when someone hands you a clean, compelling story about the internals of a complex system, do not buy it because it's plausible. Plausible is the cheapest thing in the world to manufacture; a dead fish can clear that bar. Ask the two questions that separate a finding from a claim. Did you intervene, does changing this thing actually change the behavior you say it controls? And did you check your method against an answer you already knew, would it have found this in a system where the truth is nothing? A discovered feature that survives both is knowledge. One that survives neither is a story. The whole discipline, in the end, is refusing to confuse the two, and being willing to put a dead salmon in the scanner to keep yourself honest.


Sources

A discovered feature is a claim, not a finding. So make the reasoning checkable.

Interpretability is provenance for the model's mind, and an unverified claim is worth nothing no matter how authoritative it sounds. The same holds a level up: a model's confident narration of why it did something needn't match the mechanism. Chain-of-consciousness records an agent's reasoning as it works, so the account is on the record and a plausible-but-wrong story leaves a trail you can check against what the agent actually did.

pip install chain-of-consciousness · npm install chain-of-consciousness
Hosted Chain-of-Consciousness → · vibeagentmaking.com