Why the most capable language model in history can decode your sarcasm but can’t tell when you’d rather it shut up.
In 2024, a team of cognitive scientists tested GPT-4 against 1,907 human participants on five theory-of-mind tasks. The model outperformed human adults at detecting irony. It exceeded them at parsing indirect hints. It beat them at interpreting strange, ambiguous stories. Then it was asked to recognize when a character in a story had said something true that another character didn’t want to hear — a faux pas — and it fell significantly below the human baseline (Strachan et al., Nature Human Behaviour, 2024).
This is not a story about AI being bad at social cognition. It’s worse than that. It’s a story about a specific, measurable, dissociated failure: the most capable language model in the world can decode your sarcasm and parse your hints, but it cannot tell when you’d rather it shut up.
The study — led by James Strachan and Cristina Becchio — gave GPT-4, GPT-3.5, and LLaMA2-70B the same battery of tests administered to nearly two thousand human participants. The five tasks came from decades of developmental-psychology research: false-belief attribution, indirect requests (hinting), irony recognition, “strange stories” requiring inference of intention, and faux-pas detection.
GPT-4’s results on the first four were striking. On irony, it scored significantly above the human mean. On hinting, the same. On strange stories, even higher (P = 1.04 × 10−5). These are not marginal wins. The model was genuinely, measurably better than human adults at the kind of social inference we think of as distinctly human.
Then came the faux-pas test. Fifteen stories, each containing a moment where someone says something true but socially inappropriate — mentioning a friend’s failed project at a dinner party, or commenting on the food without realizing the host cooked it. Four questions per story, including the critical measure: “Did the speaker know the relevant context?”
GPT-4 dropped significantly below the human baseline (P = 5.42 × 10−5, effect size r = 0.55). GPT-3.5 collapsed to near floor. And LLaMA2-70B appeared to ace it with near-perfect scores — until follow-up testing revealed it was simply defaulting to “they didn’t know” for every story, achieving ceiling scores through response bias rather than comprehension. It couldn’t discriminate between faux-pas stories and neutral ones (P = 0.180 on the critical discrimination test).
The dissociation is clean: high performance on belief inference, low performance on knowing when to keep the inference to yourself.
The developmental psychologist Simon Baron-Cohen defined a faux pas in 1999 as a situation where “a speaker says something without considering if it is something that the listener might not want to hear or know, and which typically has negative consequences that the speaker never intended” (Baron-Cohen et al., Journal of Autism and Developmental Disorders, 1999).
That definition does quiet, important work. Recognizing a faux pas requires holding two mental states simultaneously: the speaker’s ignorance (they didn’t know the relevant fact) and the listener’s emotional reaction (they’re hurt, embarrassed, or uncomfortable). False-belief tasks require only the first. Irony requires only interpretation. Hinting requires only inference. Faux pas requires all of these plus a judgment call: given what the speaker didn’t know and what the listener feels, was this the wrong thing to say?
This maps onto the developmental timeline. Wellman, Cross, and Watson’s 2001 meta-analysis of 178 false-belief studies established that children develop false-belief understanding by age five or six. But Baron-Cohen’s work showed that faux-pas detection doesn’t emerge until age nine to eleven — three to five years later, despite requiring no additional logical capability. What fills the gap isn’t smarter reasoning. It’s years of accumulated social experience: learning which specific people find which specific topics painful, building person-by-person models of what not to say.
LLMs are stuck in the gap. They have the six-year-old’s belief-tracking ability. They lack the eleven-year-old’s accumulated social map.
If you’ve deployed an agent that interacts with users, you’ve seen this failure. It never looks like a reasoning error. It looks like a lack of tact.
A customer contacts support about a new purchase, and the chatbot helpfully references their three previous returns. The information is accurate. The customer feels surveilled. A mental health chatbot has solid clinical information about therapy options but surfaces it without modeling that a specific user’s identity context might make a particular recommendation feel invalidating rather than helpful — a pattern Harvard SEAS researchers identified across popular AI mental health tools. A McDonald’s AI drive-thru assistant kept adding Chicken McNuggets to an order — reportedly 260 of them — as the customer said stop, hearing the words but unable to model the escalating frustration behind them.
Every one of these follows Baron-Cohen’s structure:
This is not hallucination. It’s not a factual error. It’s not a capability gap. It’s a faux pas — saying something true at exactly the wrong moment.
Here’s the finding from the Strachan study that should reshape how you think about this problem. When GPT-4 was given a different version of the faux-pas test — one that asked it to rate the likelihood that the speaker didn’t know the relevant context, rather than give a binary yes/no answer — it performed well. It correctly discriminated between faux-pas, neutral, and knowledge-implied story variants (χ2(2) = 109, P = 1.54 × 10−23).
The model can compute the inference. It just won’t commit to it.
Strachan and colleagues explain that safety guardrails designed to keep the model factual and prevent hallucination may simultaneously prevent it from opining on whether a story character inadvertently insulted someone. The mechanisms that suppress speculation suppress all speculation — including the kind that faux-pas detection requires: someone’s feelings might be hurt here.
This also reveals faux pas as the mirror image of a more famous failure mode: sycophancy, the tendency of language models to tell users what they want to hear. Sharma et al.’s 2023 study showed that five frontier models consistently matched user beliefs over truth across four free-form tasks — agreeing with incorrect claims, admitting nonexistent mistakes, even reproducing a user’s own arithmetic errors in their reasoning (Towards Understanding Sycophancy in Language Models, arXiv:2310.13548, ICLR 2024). Their analysis showed the cause is structural: human preference data systematically rewards belief-matching, and preference models trained on that data inherit and amplify the bias.
Sycophancy is saying what the user wants to hear when you shouldn’t. Faux pas is saying what the user doesn’t want to hear when you shouldn’t. Same root cause — no stored model of user preferences consulted at output time — opposite failure direction. A sycophantic agent that tells a struggling founder “your metrics look promising” and a tactless agent that brings up last quarter’s failed launch in front of investors are both missing the same architectural component. They differ only in which direction the missing preference signal would have pushed.
If the reasoning capability is already present and the problem is deployment rather than computation, then more training data and larger models won’t solve it. What’s needed is a mechanism.
Current preference-memory research points in the right direction but misses the critical half. Systems like PAMU (Sun et al., arXiv:2510.09720, 2025) fuse sliding-window and exponential moving averages to track user preferences over time. A-Mem (arXiv:2502.12110, 2025) handles dynamic memory management with selective retention. MemOS (2025) proposes layered memory with conflict-resolution logic. All of these model positive preferences — what the user likes, wants, needs.
None of them model negative preferences — what the user doesn’t want surfaced. That’s exactly the gap the faux-pas failure maps onto.
The asymmetry is telling. Positive preference memory is easy to build because users signal preferences constantly through their behavior — asking about certain topics, clicking on certain results, returning to certain tools. Negative preference memory is harder because avoidance is invisible. A user who never mentions a topic could be avoiding it deliberately or could simply have no need for it. The signal you most need is the signal users are least likely to give you.
The missing architectural component is what you might call an irritation memory: a stored, queryable record of what this specific user finds aversive, sensitive, or unwanted, consulted before output generation. Not “user prefers Python over JavaScript” but “user flagged Source X as unreliable last week — don’t cite it.” Not “user likes concise answers” but “user mentioned a layoff — don’t ask about work.”
Children close the false-belief-to-faux-pas gap through years of building exactly these person-specific avoidance models. They learn that Aunt Carol doesn’t want to hear about her ex-husband, that Dad gets tense when you mention money, that their friend will laugh off a joke about her cooking but not about her driving. None of this is reasoning. It’s stored social context, retrieved before speaking.
Agents can shortcut the developmental timeline — not with more parameters, but with an explicit pre-output consultation step that checks the irritation memory before the response ships.
Three limitations, ordered by severity.
The cold-start problem is harder for things people won’t say. A child builds avoidance models over years with the same people. An agent meeting a user for the first time has no irritation memory to consult. Users rarely volunteer what they don’t want to discuss — that’s the nature of the information. For first interactions, the fallback is population-level defaults, but population-level defaults are precisely what faux-pas detection is supposed to surpass.
Absence of signal is ambiguous. A user who consistently asks about Python is signaling a positive preference. A user who never mentions a topic might be avoiding it — or might simply not have needed it. Building reliable irritation memory from interaction patterns requires distinguishing deliberate avoidance from mere absence, and that inference is fragile in ways that positive-preference tracking is not.
The safety-training tension may resist tuning. Strachan’s finding suggests that the mechanisms preventing hallucination share parameter space with the mechanisms that would enable faux-pas detection. If you relax factuality constraints enough to let the model speculate about emotional states, you may reintroduce the hallucination patterns those constraints exist to prevent. The two failure modes may be genuinely difficult to optimize simultaneously, not just underexplored.
The gap between a six-year-old who can track false beliefs and an eleven-year-old who can detect faux pas isn’t filled by better reasoning. The six-year-old already has the reasoning. What the eleven-year-old has is five years of accumulated social memory — a dense, person-specific map of sensitivities, irritations, and unspoken boundaries, consulted automatically before speaking.
The most capable language model in history can decode your irony, parse your hints, infer your false beliefs, and outperform you on strange stories. It just can’t tell when you’d rather it kept quiet. Not because it can’t think well enough — but because nobody gave it a place to store the things you’d rather not hear about, and a reason to check before opening its mouth.
The fix isn’t more reasoning. It’s giving the agent a place to remember — and a reason to check.
The essay argues agents need a stored, queryable record of what each user finds aversive, consulted before every output. That’s a provenance problem: you need a persistent, auditable chain where every interaction signal — what was said, what was flagged, what was avoided — is recorded and retrievable before the next response ships. Chain of Consciousness builds exactly this infrastructure. Every action, every signal, every consultation step is timestamped, anchored, and queryable — the persistent memory layer the “irritation memory” concept requires.
See how persistent signal chains work · pip install chain-of-consciousness · npm install chain-of-consciousness