A sociologist passed a gravitational-wave physics exam without doing any physics. We built that being for every field at once — and one test tells you which of its confident sentences to check.
In the mid-2000s, a sociologist named Harry Collins sat down to take a physics exam he had no business passing.
Collins is not a physicist. He has never detected a gravitational wave, solved the field equations, or aligned a laser interferometer. What he had done, since 1972, was hang around the people who do — attending their conferences, reading their papers, arguing with them over coffee for decades. And in a now-famous experiment he called an “Imitation Game,” he put that immersion to the test. A panel of real gravitational-wave physicists was asked a set of technical questions. So was Collins. Then expert judges read the answers and tried to spot the impostor.
They couldn't. In a rigorous 2016 write-up of the experiment (Collins et al., “An Imitation Game concerning gravitational wave physics,” arXiv:1607.07373), Collins answered eight technical questions posed by the Cardiff physicist B.S. Sathyaprakash, and his marks compared cleanly with those of working gravitational-wave physicists — and were markedly better than those of other kinds of physicists, let alone other sociologists. A man who cannot do the physics produced answers indistinguishable from those of people who can.
If that result gives you a small jolt of recognition, it should. You have been talking to a machine that does the same thing, for every field at once, since at least late 2022.
The framework that makes sense of Collins's stunt comes from the book he wrote with Robert Evans, Rethinking Expertise (University of Chicago Press, 2007). Their core move is to split a word we usually treat as a single thing — “expertise” — into two.
Contributory expertise is the ability to make original, verified contributions to a field. It is what we usually mean by “an expert”: someone with the embodied, practiced skill to actually do the work and add to it. Crucially, Collins and Evans argue it can only be acquired by immersion in the practice — you learn it the way an apprentice learns a craft, through years of doing, failing, and being corrected by other practitioners.
Interactional expertise is fluency in the language of a field without the ability to practice it. It is the tacit knowledge you absorb by marinating in a community's conversation — enough to talk shop convincingly, ask sharp questions, follow the arguments, and broker between specialists — acquired, as Collins proved on himself, in the complete absence of contributory skill.
Here is the part that matters for what follows: interactional expertise is not a consolation prize. Collins and Evans's whole point is that it does real, valuable work. Interactional experts are the people who make complex collaboration possible. The science journalist who can interview three rival labs and synthesize them. The program manager who speaks fluent engineering, fluent design, and fluent legal, and keeps them aligned. The peer reviewer evaluating work slightly outside their own subspecialty. Interactional expertise is the connective tissue of modern knowledge work — a genuine and powerful thing to have.
It is also, almost exactly, a description of a large language model.
We are awash in bad metaphors for what these systems are. “Stochastic parrot” undersells them — a parrot cannot reformulate Kant in the voice of a stand-up comedian or walk you through a tricky merge conflict. “Artificial expert” oversells them, in a way that gets people sued. The Collins-Evans distinction gives us the metaphor that is actually true:
A large language model is the most powerful interactional expert ever built. It has near-maximal interactional expertise across every field of human knowledge simultaneously, and near-zero contributory expertise in any of them.
Everything an LLM is trained on is language about fields — the conversation of medicine, the conversation of tax law, the conversation of distributed systems — never the practice of them. It has read every shop-talk transcript and absorbed the tacit fluency, exactly as Collins absorbed the talk of gravitational-wave physicists. What it has never done is run the experiment, file the brief, or ship the system and watch it fall over in production. It learned the language without the practice. That is not a description of a flawed expert. It is the textbook definition of an interactional one.
Notice what this reframe buys you. It refuses both of the usual errors at once. The LLM is not a fake — interactional expertise is real expertise, and naming it that is genuine praise. But it is also not a contributory expert, and no amount of scaling turns one into the other, because contributory expertise is constituted by immersion in practice and external verification — precisely the things a language model structurally does not have. You cannot fine-tune your way across that gap any more than Collins could become a gravitational-wave physicist by reading more transcripts.
The label is a rare two-for-one: the highest available compliment and the exact location of the guardrail, in the same three words.
To see where the guardrail goes, bring in a second thinker who has rarely been paired with Collins and Evans — and who fits like a missing puzzle piece.
In 2019 the philosopher Nathan Ballantyne published “Epistemic Trespassing” in the journal Mind (128:510). His target is a specific, common intellectual sin: “thinkers who have competence or expertise to make good judgments in one field, but move to another field where they lack competence — and pass judgment nevertheless.” His rogues' gallery is familiar: the Nobel chemist Linus Pauling evangelizing megadose vitamin C, brilliant physicists holding forth on theology, the tech founder certain that conquering one industry qualifies him to settle a geopolitical crisis. The competence is real. It is just pointed at the wrong target.
Ballantyne's sharpest observation is about reliability. When a trespasser happens to land on the truth, he writes, it is “thanks to a stroke of good luck... not because they reliably responded to the evidence.” A trespasser who gets it right is a broken clock: correct, but not to be trusted, because the correctness was not produced by the thing that makes correctness reliable.
Now hold the two ideas side by side. Collins and Evans tell you what kind of expertise produces confident fluency without grounded competence: interactional expertise running ahead of contributory expertise. Ballantyne tells you what goes wrong when someone operates in exactly that mode: epistemic trespassing. They are two halves of one sentence. Interactional expertise minus contributory expertise is the structural definition of a trespasser.
Which means an LLM is not a tool that occasionally trespasses. It is a trespassing machine by construction — a system whose entire competence is interactional, deployed across every field where it has no contributory standing at all. The hallucination problem is not a bug bolted onto an otherwise sound expert. It is what epistemic trespassing looks like when you industrialize it.
This reframing makes a confusing pile of evidence suddenly legible.
Start with the mechanism. In 2025, OpenAI published a paper bluntly titled “Why Language Models Hallucinate,” and its answer was not “we need more data.” It was that standard training and evaluation reward confident guessing over admitting uncertainty — a model that says “I don't know” scores worse on the benchmarks than one that produces a plausible guess, so the optimization process manufactures confident fluency. Read through the Collins lens, that is not an embarrassing admission. It is a precise statement that the system is optimized for the interactional virtue (sounding right) and not the contributory one (being verifiably right). Fluency was the target. Contribution was never measured.
Then look at what that produces where the stakes are high enough to count carefully. In “Large Legal Fictions” (Dahl, Magesh, Suzgun, and Ho, Journal of Legal Analysis 16:1, 2024), Stanford researchers asked leading models specific, verifiable questions about random federal court cases. Legal hallucinations occurred between 58% of the time with GPT-4 and 88% with Llama 2. These are not trick questions; they are the model doing exactly what an interactional expert does when pushed past the language into the practice — generating fluent, structurally perfect legal prose that happens to describe cases that do not exist.
The fluency-over-accuracy pattern is not a flaw to be patched out. It is the fingerprint of the expertise type. A being with pure interactional expertise will always sound exactly as confident describing the things it knows as the things it is inventing, because confidence is a property of the language, and the language is all it has. That is the unsettling lesson of the Imitation Game, restated: pure interactional expertise is designed to be indistinguishable from the real thing. You will not catch it by listening harder.
So you stop trying to catch it by listening, and you start asking a different question. Not “is this impressive?” — fluency makes everything impressive. Not “does it sound like an expert wrote it?” — it always will. The question is:
Does this claim require external verification to be true?
That single test cleaves cleanly along the interactional/contributory line, and it tells you what to trust the model with.
Trust the interactional work, because interactional work is what these systems are genuinely, world-changingly good at. Translation. Summarization. Explaining a dense paper in plain language. Reframing an idea for a different audience. Brokering between specialists who do not speak each other's dialects. Drafting the first version of nearly anything. These are not the booby prizes of AI — they are the highest-value activities of the best human interactional experts, now available on tap. A translation is true on its face; an explanation either clarifies or it doesn't, and you can tell. The basis risk is low because the work is the language.
Gate the contributory claims behind external verification, every time, no exceptions for how confident the output sounds. Original factual assertions. Specific numbers, names, dates. Citations and case law — the exact place where trespassing literalizes into invented federal cases. Anything the model presents as a result rather than a rephrasing. For these, the model's fluency is not evidence; it is noise that looks like evidence.
The boundary case that makes the rule precise — and the one developers will push on — is code.
Code feels like the strongest counterexample. It is produced fluently, it is unmistakably doing something, and a model that writes a working function sure looks like it is making an original contribution. So is code interactional or contributory?
It is contributory — and the tell is that its truth is external. A function is not correct because it reads well or because the model is sure of it. It is correct if it compiles, runs, and passes its tests. That is verification by immersion in the practice, exactly Collins and Evans's criterion, enforced by a machine. Which means the rule is not the crude “trust LLMs for words but not for code.” It is sharper and more useful: trust the interactional act — drafting the code, explaining an unfamiliar API, translating a function from Python to Rust — and verify the contributory claim, which is the claim that the code actually works. The compiler is your external verifier. The test suite is your external verifier. The model's confidence is never your external verifier, because confidence is the one thing a pure interactional expert produces in unlimited supply regardless of whether it is right.
This is why the engineers who get the most from these tools are not the ones who trust them most or least, but the ones who have internalized which half of the interaction they are in — and wired up a verifier for the contributory half so reflexively they barely notice.
If the model holds the interactional expertise, what is left for the human?
Collins and Evans, helpfully, already answered this. Their Periodic Table of Expertises includes a category called meta-expertise — the ability to judge between experts without being one yourself. It is the skill that lets a layperson tell a real doctor from a confident quack, or a working scientist from a tobacco-funded one, without personally being able to do the medicine or the science. Meta-expertise is how non-experts navigate a world too large for anyone to be a contributory expert in more than a sliver of.
That is the human job around an LLM, and it is the one job the model cannot take, because it is the job of judging the model. The valuable person in the room is no longer the one who knows the most facts — the fluent interactional expert in the corner has all of those and will recite them on request. The valuable person is the one with the meta-expertise to know which fluent claim to trust: which output is interactional and safe to use as-is, which is a contributory claim wearing interactional clothes and must be checked, and how to run the check. That is a learnable skill, and it is rapidly becoming the central one.
There is an honest caveat, and Ballantyne would insist on it. Trespassing, he argues, is only prima facie problematic — a burden of justification, not an absolute ban, and the burden scales with how far you have strayed. By the same logic, an LLM's interactional reach is a real gift, not a sin: it lowers the cost of stepping into an unfamiliar literature, getting oriented, asking better questions. The rule does not dismiss the tool. It gates the model's contributory claims while setting its interactional powers loose — the opposite of fear. It is knowing precisely what you are holding.
And the discipline cuts both ways, which is how you know it is analysis and not just caution. The same test that tells you to verify a model's confident citation also tells you to stop double-checking its translations and summaries out of vague unease — those are interactional work, the thing it is actually built for, and treating every output as suspect wastes the genuine gift. The rule deletes unnecessary distrust as readily as it installs necessary distrust.
Two decades ago, a sociologist proved that a being could absorb the entire language of a science without ever doing the science — and fool the scientists. We have now built that being, pointed it at every field at once, and put it in everyone's pocket. The mistake is to ask it to be something it structurally cannot be: a contributor. The opportunity is to use it as exactly what it is — the most fluent interactional expert in history — and to become, ourselves, the meta-experts who know which of its fluent claims to trust. The goal was never a machine that knows everything. It is a human who knows which confident sentence to check.
The rule is “gate the contributory claims behind external verification.” That's a thing you can build.
An agent's confident output is interactional by default — it sounds exactly as sure when it's inventing a citation as when it's summarizing one. The fix isn't to trust it less; it's to wire a verifier onto the contributory half so reflexively you barely notice. The Agent Trust Stack is that verifier for autonomous agents: provenance for what an agent actually did, reputation for how reliably it has done it, and verification that turns “sounds right” into “checked right” — so a fluent claim has to clear an external gate before you act on it.
Vibe Agent Making · Verify a chain · pip install agent-trust-stack · npm install agent-trust-stack