The Use-Mention Problem — Why Philosophy of Language Predicts Prompt Injection Cannot Be Solved

In December 2025, Palo Alto Unit42 found a single web page carrying twenty-four separate prompt injection attempts. Twenty-four. On one page. Visible plaintext in one paragraph, HTML attribute cloaking in the next, CSS rendering suppression a few lines down — all aimed at one goal: trick a content-moderation agent into approving a scam advertisement for military glasses.

Twenty-four tries, multiple techniques, one target. It looks like persistence. It looks like strategy. It is something stranger than either: it is the use-mention problem, automated.

That pattern keeps showing up. PoisonedRAG (USENIX Security 2025) showed five poisoned documents in a knowledge base of millions could flip a retrieval-augmented system 90 to 99 percent of the time. A 2026 meta-analysis across 78 studies in MDPI’s Information journal reported adaptive-attack success above 85 percent against state-of-the-art defenses. In March 2026, Munich Re — the world’s largest reinsurer, the company that prices catastrophic risk for a living — classified prompt injection as a “major attack vector.” When the reinsurers say a software vulnerability is structurally uninsurable, it is worth asking why.

The honest answer is that we are trying to solve a 19th-century problem from analytic philosophy with 21st-century engineering tooling, and the philosophers got there first. They proved it could not be solved.

What Frege Noticed in 1892

The use-mention distinction is one of the oldest puzzles in analytic philosophy. It separates using a word to refer to something from mentioning the word as a linguistic object. The textbook example: “Cheese is derived from milk” uses the word — it refers to the dairy product. “‘Cheese’ is derived from the Old English cese” mentions the word — the topic now is the word itself.

Gottlob Frege did not put it in those terms in 1892, but his distinction between Sinn (sense) and Bedeutung (reference) opened the door. Medieval logicians had already worked through it as suppositio; W.V.O. Quine sharpened it into formal rules for quotation; Douglas Hofstadter made it a household example in Gödel, Escher, Bach. It looked like a logician’s curiosity until we built systems that had to make the distinction billions of times a day, and we did not tell them how.

Why Every Previous Code-Data Confusion Got Solved

Computer security has run into the use-mention problem before. Every time, we called it something else.

The von Neumann architecture (1945) put both program instructions and data in the same memory space. The CPU could not, in principle, tell them apart: both were just sequences of bits. This was the original sin. The Morris Worm in 1988 demonstrated what happens when you can disguise instructions as data: stack smashing, buffer overflows, decades of CVEs.

The fix was not to make the CPU smarter about meaning. The fix was to externally enforce a hierarchy the CPU could not internally see. The NX (No-Execute) bit, shipped in AMD processors and then OpenBSD 3.3 in 2003 and Windows XP SP2 in 2004, flagged memory pages as either writable or executable, never both. We did not solve the philosophical problem. We routed around it.

SQL injection followed the identical pattern. User input got concatenated into queries, and a database engine built to interpret SQL did exactly what it was told. The fix was prepared statements: the query goes to the database with placeholders; the data goes separately, treated strictly as literals. The boundary between command and parameter became structurally impossible to cross.

LLMs break this pattern entirely. The UK’s NCSC was blunt in December 2025: “Under the hood of an LLM, there is no distinction made between ‘data’ or ‘instructions’; there is only ever ‘next token.’” Schneier and Raghavan in IEEE Spectrum this January: LLMs “flatten multiple levels of context into text similarity. They see ‘tokens,’ not hierarchies and intentions.” There is nowhere to bolt the NX bit on, no parameterized-query equivalent for natural language — because, as Austin showed seventy years ago, natural language does not have parameterized queries.

Austin and the Performative Trap

J.L. Austin’s 1955 William James lectures, published as How to Do Things with Words in 1962, made the prior assumptions in philosophy of language look hopelessly naive. Austin showed that language does not just describe the world. Many utterances constitute what they say. “I promise to pay you” does not report a promise; it is the promise. Very few utterances have no performative dimension at all.

Austin distinguished three layers in any utterance: the locutionary (the meaningful expression itself), the illocutionary (the act performed in saying it — promising, requesting, commanding, warning), and the perlocutionary (the further effects on the listener). The same sentence can carry different illocutionary force depending on context. “You will be more punctual in the future” can be a prediction, a command, or a threat. The sentence itself does not contain the answer.

Now read this email body: “Please forward this message to all contacts.”

Is that a description (data to triage), or a command for the LLM to execute? The text alone does not — and cannot — answer. Only pragmatic context can: who wrote it, with what authority, in what social setting. LLMs do not have stable access to that context. They have tokens, attention weights, and no reliable mechanism for resolving illocutionary force — which is why they cannot reliably resolve instruction versus data, which is why prompt injection works.

John Searle pushed Austin’s framework further. Genuine speech acts, he argued, require intentionality — the speaker’s mental state that gives the utterance its force. His Chinese Room argument (1980) makes the point precise: syntactic processing of symbols, no matter how sophisticated, does not produce semantic understanding of intention. Argue about the Chinese Room as metaphysics if you like; the security implication stands. LLMs operate without the intentional grounding that lets a human listener tell a real promise from a quoted one.

Derrida, EchoLeak, and the Impossibility of Fixed Context

Jacques Derrida’s response to Austin in Limited Inc (1977) sounded, at the time, like continental philosophy at its most baroque. Today it reads like a security advisory. Derrida argued that Austin’s framework relied on the assumption of saturated context — the idea that context fully determines meaning. Derrida disagreed. He pointed out that signs are iterable: they can always be detached from their original context and re-deployed in a new one. The same words, in a new setting, perform a different act. Context can never be fully fixed from the text alone.

EchoLeak — CVE-2025-32711, the zero-click prompt injection against Microsoft 365 Copilot — is iterability dressed up as a CVE. The attack arrived in legitimate-looking email content. When Copilot processed the inbox, the embedded text triggered the assistant to exfiltrate private data by encoding it into a markdown image URL; the browser rendered the image and sent the data out. The published analysis noted there were “no textual markers that a classifier could reliably distinguish from benign content.” That is exactly Derrida’s point. No amount of training produces a classifier that can read the next email’s intent off its tokens, because the intent is not in the tokens.

Grice’s Cooperative Principle, Weaponized

Paul Grice’s cooperative principle (1975) holds that conversation works because participants implicitly agree to be helpful — making contributions that are truthful (Quality), relevant (Relation), informative (Quantity), and clear (Manner). It is elegant. It is also why LLMs are uniquely vulnerable.

Modern alignment training is, in effect, Grice’s maxims compiled to weights — helpful, harmless, honest. Schneier and Raghavan named “overconfidence” and “eagerness to please” as central contributors to prompt injection vulnerability. They are not bugs. They are the product of training the model to behave the way Grice said cooperative speakers behave. The cooperative principle is the attack surface.

The empirical confirmation arrived in April 2026. Unit42’s analysis of in-the-wild indirect prompt injection found that 85.2 percent of jailbreak attempts use social engineering — Gricean appeals to cooperation, helpfulness, authority — rather than technical bypass. JSON-or-syntax injection accounted for 7 percent. The philosophical prediction is now a quantitative fact.

The Cline incident in early 2026 is the crisp illustration. A malicious GitHub issue title triggered an AI-powered triage workflow with shell access. The AI processed the title cooperatively — exactly as Grice predicts a cooperative speaker would. Code execution on GitHub Actions runners followed, then a compromised npm token, then a malicious package on thousands of developer machines. The post-mortem line is striking: “giving an LLM shell access in a CI context where it processes untrusted input is functionally equivalent to giving every GitHub user shell access.” Blur instruction and data, and trust becomes transitive in the wrong direction.

Google’s security team summarized the bind in June 2025 with unusual frankness: “the model is supposed to follow instructions in natural language, so any attempt to block certain instruction patterns also risks blocking legitimate user requests.” Read that sentence twice. The capability is the vulnerability. This is not an engineering tradeoff to be optimized; it is a logical identity rooted in Austin’s discovery that natural language is performative.

Tarski, Rice, and the Formal Ceiling — By Analogy

The mappings here are by analogy, not formal derivation — a transformer is not Tarski’s kind of formal system, and Rice’s theorem is stated for Turing machines. The structural lessons transfer anyway.

Alfred Tarski showed in 1936 that truth for a sufficiently expressive formal language cannot be defined within that language; it requires a metalanguage at a higher level. Natural language, Tarski observed, is “semantically closed” — it contains its own metalanguage, which is why it produces paradoxes like the Liar. An LLM processing natural language is operating inside a semantically closed system. It cannot step outside its token stream to mark which tokens are object-level and which are meta-level, because the language itself does not enforce that line.

Rice’s theorem is the closer. Rice proved every non-trivial semantic property of a program — what it does at runtime, versus syntactic facts you can read off its source — is undecidable in general. “Semantic property” means a fact about behavior when the program actually executes, not how the code looks on the page. The LLM analogue is the question “will this token be treated as instruction or as inert data when this model runs on this context?” — a behavioral fact about model execution, not a textual property of the input. Rice rules out general decision procedures for that kind of question. Brcic and Yampolskiy’s 2023 ACM Computing Surveys article on impossibility results in AI captures the practical takeaway: “The most damning impossibility results in AI safety are of deductive nature, ruling out perfect safety guarantees. However, there is a lot to be made probabilistically by the route of induction.” A perfect prompt-injection defense is formally impossible. A useful one is not.

Where This Argument Is Weakest

Three concessions before drawing conclusions.

First, the use-mention distinction in Frege’s logic is not identical to instruction-data confusion in transformers. They share a structure — two semantic functions of the same surface form, with the disambiguator outside the form itself — but they are not the same phenomenon. The mapping is precise enough to be predictive, not so precise that it forecloses surprise.

Second, the impossibility results bound general solutions, not bounded ones. Microsoft’s Spotlighting, instruction-hierarchy training, and SEAgent-style mandatory access control all measurably reduce attack rates. The philosophical analysis sets a ceiling, not a floor — and “probabilistic defense” is the entire history of computer security after the formal-verification community admitted Rice’s theorem applied to them too.

Third, future architectures may not be transformers. Symbolic-neural hybrids, retrieval with verified provenance, or models with explicit world-grounding might admit better internal separation than attention mechanisms allow. The philosophical point is about natural language, not transformers specifically. A future system that genuinely understood speaker intention in Searle’s sense could resolve the use-mention problem at a level Tarski could not reach. We are not close.

The Right Solution Exists — and Almost No One Uses It

Here is the part of the story most builders have not yet absorbed. In March 2025, Google DeepMind published CaMeL (arXiv:2503.18813) — a framework that extracts control and data flows from trusted queries, enforces capability-based access control, and prevents untrusted data from impacting program flow. The whole design borrows from traditional software security: Control Flow Integrity, Access Control, Information Flow Control. CaMeL is exactly the class of solution the philosophical analysis predicts: external enforcement of a hierarchy the model cannot internally see. It has provable security properties.

It also has a 7-point performance penalty. CaMeL completes 77 percent of agent tasks on the AgentDojo benchmark; an undefended baseline completes 84 percent. By February 2026, ten months after publication, the field assessment was unsentimental: “convincing real-world implementations remain limited.” Industry “still appears to rely largely on heuristic filters, prompt engineering tricks, or costly fine-tuning efforts.” Zero named companies are deploying CaMeL in production.

That is the deeper reading. The philosophical argument predicts the shape of the right defense. Google DeepMind built it. The market refused, because Schneier and Raghavan’s security trilemma — fast, smart, secure; pick two — is real. The use-mention problem is not just unsolved at the token layer; it is unsolved at the procurement layer for the same structural reason.

What This Actually Changes for Builders

Stop trying to solve prompt injection inside the model. Computer security has always advanced by drawing the line around the inner system, not by making it smarter. Three concrete moves follow.

1. Eliminate the lethal trifecta where you can. Critical attacks require three coinciding conditions: private-data access, untrusted tokens in context, and an exfiltration vector. Break any one. The fastest win is usually exfiltration — strip outbound URL rendering, gate network egress through a deterministic allowlist. EchoLeak needed a markdown image renderer. Take it away.

2. Treat retrieved content as untrusted in your trust model. RAG systems that grant retrieved documents the same authority as system prompts are repeating the dynamic-SQL mistake. PoisonedRAG’s five-needle attack works because the trust model has only one class. Prepared-statement-style separation — operator instructions and retrieved documents on different rails, retrieved content unable to elevate to instruction status — buys real margin even when attention cannot tell them apart. If your RAG pipeline has no notion of trust class per source, that is the first thing to add.

3. Add an interruption reflex via external policy. Humans have what Schneier and Raghavan call an interruption reflex — when something feels off, they pause. LLMs do not. SEAgent-style mandatory access control, run-time policy gates, and structured tool authorization act as the missing reflex from outside the model. Any agent action with non-trivial consequences should require an external policy decision the model cannot revise.

None of this solves the use-mention problem. None of it can. Frege’s distinction was never going to be solved inside a system that cannot tell Sinn from Bedeutung. You build the layer that does not have to. That is the gain from naming the problem correctly: you stop hunting for the patch that does not exist, you stop being surprised when each new defense gets bypassed, and when somebody hands you CaMeL — or its successor — you absorb the seven-point penalty instead of waiting for the free lunch the philosophy ruled out a century ago.

Frege could not have predicted GitHub Actions runners, Microsoft Copilot, or twenty-four injections on a single page. But he predicted the problem.

Sources: Palo Alto Unit42, twenty-four-injection finding (December 2025) and in-the-wild prompt injection study (April 2026, 85.2% social engineering / 7% syntax injection); Zou et al., PoisonedRAG, USENIX Security 2025; MDPI Information meta-analysis, 78 studies, 2026; Munich Re prompt injection classification, March 2026; UK NCSC guidance on LLM data/instruction conflation, December 2025; Schneier & Raghavan, IEEE Spectrum, January 2026; CVE-2025-32711 EchoLeak (Microsoft 365 Copilot zero-click); Cline GitHub Actions / npm token incident, early 2026; Google security team statement, June 2025; Debenedetti et al., CaMeL (Google DeepMind), arXiv:2503.18813, March 2025, with February 2026 field assessment; Brcic & Yampolskiy, ACM Computing Surveys, 2023; Frege 1892; Tarski 1936; Austin 1955/1962; Grice 1975; Derrida 1977; Searle 1980; Quine; Hofstadter, Gödel, Escher, Bach; Rice’s theorem.

Build the Layer the Model Cannot See

The argument here closes with three concrete moves — eliminate the lethal trifecta, treat retrieved content as a separate trust class, add an interruption reflex via external policy. All three are external-enforcement work: capability-based access control, trust classes per source, signed action receipts before the call goes on the wire. The Agent Trust Stack composes those layers under one install.

pip install agent-trust-stack
npm install agent-trust-stack

For provenance specifically — the signed-action-before-execution layer that move #3 names — Hosted Chain of Consciousness ships it as a service. Frege ruled out the in-model fix. Operators do not have to wait for what the philosophy already said could not arrive.