How ML engineers reinvented a therapy technique from 1994 — without reading the literature.
In 2025, researchers tracking the internal reasoning of large language models on graduate-level science questions noticed something troubling. On the GPQA Diamond benchmark, models running in reasoning mode would arrive at correct answers early in their chains of thought — then keep generating. “Let me verify this.” “Wait, actually…” “On second thought…” By the end of their extended deliberation, errors were “predominantly introduced during subsequent reflective operations after correct initial inference.”1 The model had the right answer. It thought about it more. It changed its mind. The new answer was wrong.
This behavior has a clinical name. It has been studied for more than thirty years, with randomized controlled trials, meta-analyses, and a dedicated therapeutic intervention. In psychology, it is called rumination — and the fact that machine learning engineers arrived at the same diagnosis and the same treatment without knowing the literature is one of the more instructive collisions between two fields that don’t read each other’s journals.
The March 2025 survey “Stop Overthinking” (Sui et al., arXiv:2503.16419) was the first systematic treatment of LLM overthinking as a named failure mode.2 The authors defined it crisply: unnecessarily extended reasoning that produces verbose, redundant output while consuming excess compute — and sometimes degrading accuracy. Not “too many tokens.” A process that actively makes things worse.
Two months earlier, Muennighoff et al. at Stanford had published what may be the simplest intervention in recent AI research (arXiv:2501.19393).3 They fine-tuned a Qwen2.5-32B model on just 1,000 reasoning traces, then applied “budget forcing” — when the model tried to stop thinking prematurely, they replaced the stop token with the word “Wait” to force continued reasoning. When the model kept thinking past the point of usefulness, they forcibly closed the thinking block. The technique required no retraining, no reward model, no reinforcement learning. Their s1-32B model exceeded OpenAI’s o1-preview by up to 27% on MATH500 and improved AIME24 accuracy from 50% to 57%.
The critical insight was not the upward improvement. It was the symmetry: stopping overthinking improved performance as much as extending reasoning on hard problems.
Subsequent work quantified the waste. The RCPD method achieved 25–44% token reduction across benchmarks while maintaining or improving accuracy — meaning roughly a third of all reasoning tokens in standard inference were counterproductive overhead.1 On the 2025 AIME competition math problems, 6.67% of questions triggered infinite reflection loops: the model entered a verification cycle it could not exit. These are not edge cases. At production scale, one in fifteen hard problems sends the model into a spiral.
A separate study on metacognition in reasoning models (OpenReview, 2025) found that LLMs maintain “implicit estimates of their position within the thinking process” — something like a sense of where they are in their own chain of thought — but that these estimates are “inconsistent and easily disrupted.”4 The models have partial self-monitoring. They lack reliable stop signals. They can sense that they are thinking. They cannot tell when to stop.
In 1991, psychologist Susan Nolen-Hoeksema published a paper in the Journal of Abnormal Psychology defining what she called depressive rumination: repetitively focusing on the fact that one is distressed, on the symptoms of that distress, and on its causes, meanings, and consequences.5 Her response styles theory proposed that rumination — as opposed to distraction or active problem-solving — exacerbated and prolonged depressive episodes. Three decades of subsequent research confirmed it: rumination enhances negative thinking, impairs problem-solving, interferes with instrumental behavior, and erodes social support.6
Not all self-focused thinking is pathological. Nolen-Hoeksema and her colleagues distinguished two subtypes. Reflective pondering is active, purposeful engagement with a problem — testing hypotheses, considering alternatives, genuinely working toward resolution. Brooding is the passive, repetitive comparison of current state to desired state: “Why can’t I figure this out?” Reflective pondering is adaptive; it correlates with measures of intelligence and flexible coping. Brooding correlates with the onset and maintenance of depression.
The mapping to the machine case is immediate. Exploratory reasoning — the model testing a new hypothesis, considering an alternative solution path — is reflective pondering. Redundant verification — “Let me check this again,” “Wait, is that right?” repeated without new information — is brooding. Budget forcing targets the second without eliminating the first.
The experimental evidence for how rumination degrades cognitive performance comes from Joormann, Levens, and Gotlib (2011).7 They tested 48 participants on tasks requiring cognitive resource reallocation. On low-interference tasks, depressed ruminators performed identically to controls. Under high-interference conditions — when participants needed to release one line of processing and switch to another — depressed participants showed significantly worse performance (t(46) = 4.46, p < .001). The correlation between brooding rumination and impaired cognitive flexibility was r = .75 (p < .001). Brooding specifically predicted the impairment (β = .297, p < .01). Reflective pondering did not.
The parallel is precise. LLM overthinking does not degrade performance on easy problems — the model handles those in its early reasoning tokens. It degrades performance when the model has already converged on an answer and needs to stop processing and move on. Rumination does not impair simple cognition. It impairs the ability to release a completed thought. Both are failures of disengagement, not failures of engagement.
In the mid-1990s, Adrian Wells and Gerald Matthews formalized the mechanism with their Self-Regulatory Executive Function model.8 Their central construct was the Cognitive Attentional Syndrome, or CAS — a perseverative thinking style driven by metacognitive beliefs. Two types of belief maintain the loop. Positive metacognitive beliefs (“Ruminating helps me understand my problems”) initiate the cycle. Negative metacognitive beliefs (“I can’t control my thoughts”) perpetuate it by adding worry about the worry itself. The result: a monitoring process — “Is it resolved yet?” — that keeps firing after the resolution has already been reached.
The therapeutic intervention that emerged from this model is Metacognitive Therapy, or MCT. Unlike Cognitive Behavioral Therapy, which challenges the content of distorted thoughts (“Is this belief really true?”), MCT targets the process. It teaches patients to recognize the rumination trigger, adopt what Wells calls “detached mindfulness” — observing the intrusive thought without engaging further processing — and disengage the monitoring loop. Don’t argue with the thought. Don’t verify the answer again. Let it pass.
Normann and Morina’s 2018 meta-analysis of MCT across 25 studies and 780 patients found a within-group effect size of g = 1.72 (95% CI: 1.44–2.00), maintained at follow-up.9 Against waitlist controls, the effect was very large: g = 2.06. Against CBT specifically, MCT showed a moderate advantage: g = 0.69 at post-treatment. For depression, the effect was largest: g = 2.68. The average treatment course took 9.5 sessions.
Budget forcing is MCT administered in tokens instead of sessions. It does not change the model’s weights — the analogue of beliefs. It changes the inference process — the analogue of metacognitive strategy. It does not argue with the content of the model’s reasoning. It interrupts the process at the control level. The clinical evidence that process-level intervention outperforms content-level intervention (g = 0.69 MCT advantage over CBT) is, structurally, a prediction for ML: inference-control techniques should outperform retraining-based approaches to overthinking.
The vocabulary port is direct:
| Clinical Term | Machine Analog |
|---|---|
| Brooding rumination | Redundant verification after convergence |
| “Is it resolved?” monitoring loop | Re-entering reasoning chain after reaching correct answer |
| Positive metacognitive belief (“ruminating helps”) | Training prior that longer chain-of-thought = better performance |
| Detached mindfulness | Budget forcing: observe model state, don’t engage further generation |
| CAS reduction | Token-waste reduction without accuracy loss |
| Attention control training | Compute-optimal routing by difficulty10 |
| MCT (change the process) | Budget forcing (control inference, don’t retrain) |
| CBT (change the content) | Fine-tuning on shorter chains |
These are not metaphors. They are structural correspondences between interventions targeting the same failure pattern in different substrates.
Three places, ordered by how much they matter.
No persistent metacognitive architecture. Clinical rumination self-perpetuates through metacognitive beliefs that persist across episodes — the patient carries “ruminating helps me solve problems” from one depressive episode into the next. An LLM resets every inference. There is no carry-over of the verification compulsion between sessions; the overthinking tendency lives in the weights and the token distribution, not in a persistent belief structure. This is the deepest structural difference. It also explains why the machine version is more tractable: budget forcing works on the first application because there is no resistant belief system to dismantle. In humans, changing metacognitive beliefs takes a median of 9.5 sessions. In machines, it takes a token cap.
Training artifact, not evolved response. The analytical rumination hypothesis proposes that depressive rumination may be a maladaptive overgeneralization of an evolved mechanism — sustained analytical processing originally useful for navigating complex social problems.11 LLM overthinking has no such ancestry. It is a pure artifact of reinforcement learning that rewards correct final answers without penalizing path length, compounded by human-preference training that inflates verbosity because raters associate thoroughness with quality. The etiology is different even when the symptom presentation looks identical.
No suffering — but this matters less than you’d think. The model does not experience distress during its redundant verification loops. But detached mindfulness does not require the patient’s suffering to function. MCT works by interrupting a process, not by alleviating pain. Budget forcing works better than MCT in one specific respect: there is no resistant patient, no therapeutic alliance to build, no relapse risk from metacognitive beliefs reasserting themselves between sessions. The absence of experience makes the machine version more amenable to the intervention, not less.
Wells and Matthews published the S-REF model in 1994. Budget forcing arrived in January 2025. For thirty-one years, clinical psychology had a theory of perseverative processing, a taxonomy of adaptive versus maladaptive self-monitoring, and a therapeutic intervention that outperforms the field’s dominant paradigm — and none of it appeared in a single machine learning paper on reasoning-model efficiency.
The colloquial term “overthinking” gestures at the problem without providing diagnostic precision. “Perseverative processing” is a better diagnosis — it names the mechanism, not just the symptom. “Detached mindfulness” is a better intervention target than “use fewer tokens.” “Brooding versus reflective pondering” is a distinction that the ML literature needs and has not yet produced: the difference between a model genuinely exploring a hard problem and a model stuck in a verification loop it cannot exit.
The model on the GPQA benchmark that talked itself out of a correct answer was not thinking too much. It was brooding — passively cycling through verification of a resolved state without new information, driven by an implicit prior that more checking means better answers. The fix was not to make it think less. The fix was to interrupt a specific kind of thinking at a specific moment — the same intervention, targeting the same failure mode, that a therapist in Manchester has been teaching patients since the mid-1990s.
The clinical vocabulary is sitting there, precise and tested. The question is whether machine learning will keep reinventing the intervention from scratch, or start reading the adjacent literature.
The fix is not “think less.” The fix is external process monitoring that knows when to stop.
The essay’s central finding: process-level intervention outperforms content-level intervention. Budget forcing works because it monitors the reasoning process from outside and interrupts it at the right moment — detached mindfulness for machines. Chain of Consciousness applies the same principle to agent systems: every action anchored to a verifiable external record, so you can distinguish productive exploration from redundant verification loops. Not “this agent says it finished.” An independent audit trail that shows what actually happened at each step.
pip install chain-of-consciousness · npm install chain-of-consciousness
Hosted Chain of Consciousness → · See a verified provenance chain