The Stribeck Curve of Agent Compute: Why More Context Sometimes Makes Things Worse

In May 2025, General Motors changed the recommended oil for its 6.2-liter L87 truck engine from 0W-20 to 0W-40, replacing oil in vehicles already on the road. The engineering rationale, as reported by GM Authority, invoked a curve drawn by a German engineer named Richard Stribeck in 1902. Stribeck had plotted friction in oiled bearings against a single composite parameter and discovered something counterintuitive: more lubricant does not always mean less friction. Past a specific minimum, more oil means more drag. GM had been operating its engines on the wrong side of that minimum. The recall was a regime correction.

Most coverage treated this as a routine recall. It wasn’t. It was a major automaker explaining, on the record, that the rule of thumb most mechanics carry — “thicker oil is safer” — is not just imprecise. It is directionally wrong in a regime that is easy to enter and hard to detect from the outside. And the same shape of mistake, with the same shape of curve, is being made every day by the engineers running large language models.

The Curve

Stribeck plotted a single dimensionless number on the x-axis: the Hersey number, defined as viscosity times speed divided by load. On the y-axis: the friction coefficient between two lubricated surfaces. The plot resolves into three regimes.

Boundary lubrication (low Hersey number) is when the oil film is too thin to fully separate the surfaces. Metal touches metal at the asperities — the high points on what looks at human scale like a smooth surface. Friction coefficients run between 0.05 and 0.2. Wear is severe. What protects the bearing here is not the bulk lubricant but the additive chemistry of the oil — zinc, phosphorus, and other compounds that bond to the metal and provide a sacrificial layer.

Mixed lubrication (intermediate Hersey number) is the operating zone for most engines and gearboxes. The film is partial — sometimes the surfaces are separated, sometimes they touch. Friction falls rapidly across this regime as you increase the Hersey number.

Hydrodynamic lubrication (high Hersey number) is full fluid film. Zero metal contact. Friction is at its lowest — coefficients between 0.001 and 0.01. But here is the surprise: as you push further into the hydrodynamic regime, friction rises again. The fluid film itself, sheared between the surfaces at high speed, dissipates energy as heat. Past a certain thickness, the lubricant is the source of drag.

The curve has a minimum, and the minimum is not at the far end — it sits in the mixed regime, just past the boundary-to-mixed transition. Engine oils are tuned to operate at or near this minimum. A 2022 Geotab fleet study found that switching from 15W-40 to 5W or 10W-30 oil saves about $919 per Class 8 truck per year. A 2021 paper in Energies reported up to 24 percent reduction in hydraulic friction loss with ultra-low viscosity oil at high rotating speeds. Going thinner pays.

But the GM L87 case shows that the minimum can also be approached from the boundary side. GM’s high-pressure engine had been operating with too-thin oil; the surfaces were spending too much time in asperity contact. The fix was to increase viscosity. Same curve, opposite direction. What matters is which side of the minimum you’re on.

The Same Curve Shows Up in Long-Context AI

The AI literature of the past two years contains the same shape, scattered across papers that do not reference each other.

Liu and colleagues, in their 2024 TACL paper “Lost in the Middle,” ran a multi-document question-answering benchmark with twenty documents per query. Accuracy when the answer was at position 1: about 75 percent. Accuracy when the answer was at position 10, the middle: about 55 percent. Accuracy when it was at position 20, the end: about 72 percent. A 20-percentage-point drop from position alone, on the same model, the same task, the same documents.

Chroma Research’s “Context Rot” study, released across 2025–2026, tested 18 frontier models — Claude Opus 4, GPT-4.1, Gemini 2.5 Pro, Qwen3-235B, and others — across roughly 194,480 LLM calls. Their finding: every single model degrades monotonically with input length. Not some models. Not models below a certain capability tier. Every model.

The most damning paper is Du et al., “Context Length Alone Hurts LLM Performance Despite Perfect Retrieval” (arXiv:2510.05381, October 2025). The authors tested models that demonstrated perfect retrieval — 100 percent exact-match recitation of evidence buried in a long context. Even with perfect retrieval, performance still degraded with length, by 13.9 to 85 percent depending on task. Llama-3.1-8B on a variable-summation task: 59 percent drop at 7,500 tokens, 85 percent drop at 30,000. Mistral-7B on GSM8K: 34.2 percent drop at 30,000 tokens. Llama on HumanEval coding: 47.6 percent drop.

Then they ran the killer experiment. They masked the distractor tokens — replaced them with content-free padding — and reran the test. The drop persisted; the authors report that the degradation continues even “when [distractor tokens] are all masked and the models are forced to attend only to the relevant tokens.”

The authors’ summary is the sentence that ought to end the “more context is better” debate: the sheer length of the input alone can hurt LLM performance, independent of retrieval quality and without any distraction.

This is the hydrodynamic regime. The lubricant is not contaminated. There is no abrasive. The fluid is clean. The fluid itself, by virtue of its thickness, is the source of drag.

Why Attention Is Worse Than Oil

Self-attention, the core operation of the transformer architecture, computes pairwise relationships between every token in the input. The cost scales as the square of the sequence length:

10,000 tokens: 100 million pairwise relationships
100,000 tokens: 10 billion
1,000,000 tokens: 1 trillion

Each tenfold increase in context creates a hundredfold increase in attention computation. This is the same shape of cost structure as viscous drag in a lubricated bearing, but worse: viscous dissipation in Newtonian fluids is roughly linear in shear rate, while attention is quadratic in context length. The AI version of the Stribeck curve has a steeper hydrodynamic rise — the “too much” regime arrives faster and punishes harder than in any mechanical system.

The NoLiMa benchmark from LMU Munich and Adobe Research, presented at ICML 2025, removed literal keyword overlap between question and retrieved passage, forcing the model to reason rather than pattern-match. Eleven of thirteen frontier LLMs dropped below 50 percent of their short-context baseline at just 32,000 tokens; GPT-4o, the strongest, fell from 99.3 percent to 69.7. Architectures advertised at 128,000 or 200,000 tokens were not delivering anything close to that much useful attention. They were running deep in the hydrodynamic regime, with friction climbing.

The Sweet Spot Has Numbers

Li et al., “Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning” (arXiv:2604.11462, April 2026), tested a reinforcement-learning-trained context curator paired with a frozen foundation model. It is the closest published evidence for the Stribeck minimum in agent systems.

WebArena task: 41.2 percent success with the curator; 36.4 percent without. A 13 percent relative improvement while reducing tokens consumed by 8.8 percent.
DeepSearch task: 57.1 percent with the curator; 53.9 percent without — at roughly one-eighth the token count.
A 7-billion-parameter ContextCurator matches GPT-4o-grade context management.

The authors’ framing is worth keeping in mind: the bottleneck in web agents is not the quantity of information, but the signal-to-noise ratio. Translated into Stribeck’s terms, friction is minimized at a specific operating point, not at the maximum film thickness. Tuning to that point is the engineering job.

The most direct evidence came in September 2025. Berton et al., “CompLLM: Compression for Long Context Q&A” (arXiv:2509.19228) compressed input contexts segment-by-segment at a 2× rate and reported up to a 4× speedup in time-to-first-token, a 50 percent KV cache reduction, and performance comparable to the uncompressed baseline at long sequences. Throw away half the tokens, and accuracy doesn’t suffer. In tribological terms, CompLLM is doing what every multi-grade engine oil does — thinning the medium where viscosity isn’t earning its keep, while keeping enough film where it is.

A 2025 Google Research paper on retrieval-augmented generation documented an even more disturbing variant of this. Gemma went from 10.2 percent incorrect answers with no retrieved context to 66.1 percent incorrect with insufficient context. Adding context made the model more confident in wrong answers. This is the Stribeck failure mode at its worst — the bearing feels smoother (the model sounds more confident) while it is silently accumulating wear (silent accuracy loss).

The Mapping

Here is the analogy stated formally enough to evaluate.

Tribology	Agent Systems
Viscosity	Context budget (tokens available)
Speed	Retrieval depth or density
Load	Task complexity
Hersey number	Context budget × retrieval / task complexity
Friction coefficient	Error rate × wasted compute
Boundary regime	Token-starved: hallucination, retries
Mixed minimum	Curated, high-signal context
Hydrodynamic drag	Attention dilution at long context

What the Stribeck frame adds, beyond the empirical observation that long context sometimes hurts, is regime identification. The curve predicts that the right fix depends on which regime you’re in. If you’re in the boundary regime — token-starved, missing crucial evidence — adding context helps. If you’re in the hydrodynamic regime — over-padded, attention diluted — the same fix makes things worse. There is no universal “best” context budget, in the same way there is no universal “best” oil viscosity. There is an optimal regime per workload, and the engineering task is to find it.

The GM L87 recall is the proof that this is not metaphor. GM’s engineers identified that their engine was operating in the wrong regime — too far into the boundary regime, in their case — and corrected by changing the operating point. The Stribeck curve is a regime-diagnostic tool. The Hersey number is the diagnostic instrument. Both translate directly into compute systems if you know what to map.

The Bearing Is Not the Drivetrain

A January 2026 paper coined the term context discipline for the practice of matching context volume to task requirements rather than maximizing it. The empirical core, on Llama-3.1-70B and Qwen1.5-14B, is a distinction that matters more than it sounds. Model accuracy barely moves between short and long context: Llama drops only from 98.5 percent to 98 percent at 15,000 words; Qwen from 99 percent to 97.5. But the system performance — KV cache memory, throughput, latency, downstream chain reliability — degrades non-linearly. The model “works.” The agent built on top of the model does not.

Tribologists know this asymmetry well. A journal bearing can run in the deep hydrodynamic regime indefinitely without surface damage; the metal is fine. The pumping losses that keep that thick film fed will, however, sink the fuel economy of the entire vehicle. The bearing tolerates the regime; the drivetrain is destroyed by it. A model can answer correctly at 200,000 tokens of context while making the agent it sits inside slow, expensive, and unreliable. The case for context discipline is not “your model will be wrong” — it is “your system will be.” Most agent-engineering decisions are made downstream of the bearing.

What This Replaces

The phenomenon is not new in some senses. Information-overload research goes back at least to the 1970s. Eppler and Mengis published “The Concept of Information Overload” in The Information Society in 2004, documenting an inverted-U pattern between decision quality and information quantity across multiple disciplines.

The Stribeck frame doesn’t replace that literature. It upgrades it. The information-overload literature gives you a curve; Stribeck gives you a mechanism — viscous drag, mapped to attention dilution. The information-overload literature names a peak; Stribeck names three regimes, with different remedies for each. The information-overload literature says “less is more, sometimes”; Stribeck gives you a parameter you can compute and an operating point you can tune toward. The difference is between a phenomenology and an engineering tool.

Where It Breaks

A working analogy is one whose wrongness you can specify, and there is real wrongness here.

First, the Stribeck curve is reversible. A bearing transitions between regimes instantly with a change in speed or load. An LLM’s attention dynamics may not be — the persistence of degradation across calls suggests memory effects no bearing has.

Second, the Hersey number is continuous; agent regime transitions appear discontinuous. The NoLiMa data shows a near-cliff at 32,000 tokens for many models, not a smooth curve. Bearings glide between regimes. Transformers may snap.

Third, friction is one number. Agent failure has at least four distinct modes — hallucination, refusal, confidently-wrong answers, and latency. They probably do not collapse into a single coefficient.

Fourth, the contact pressure (task complexity) in the agent mapping is hard to measure a priori. Tribologists can put a load gauge on a bearing. There is no equivalent instrument for “how complex is this reasoning task.”

These limitations tell you which predictions to trust and which to soften. The three-regime structure transfers. The non-monotonicity transfers. The regime-dependent direction of the right fix transfers. The exact functional form does not, and shouldn’t be expected to.

The Practical Move

You stop treating context window size as a one-dimensional dial labeled “more is better.” You start asking, per workload, which regime are we in? Are we starved — hallucinating, missing evidence, retrying? Or are we drowning — confident wrong answers, attention dilution, the model finding spurious patterns in noise? The diagnostics are different, and the fixes point in opposite directions.

You stop benchmarking your agent at the maximum advertised context window. You start sweeping across 4K, 16K, 32K, 64K, 128K and looking for the curve. Atlan’s enterprise data, summarized in early 2026, shows effective context sometimes runs 99 percent below advertised maximums on complex tasks, and that complex enterprise queries can fail at as few as 400 to 1,200 tokens. The minimum is not where the marketing materials suggest it is.

You stop conflating retrieval depth with retrieval quality. ActiveContext and CompLLM both show the same thing from opposite directions: same task, fewer tokens, comparable or better performance. Signal-to-noise ratio is the lambda parameter of agent systems, and the 128K-token run stuffed with everything does not dominate the 32K-token run with curated context. The curve has a minimum, and the minimum is rarely at the far end.

The mechanic putting 20W-50 in a Honda Civic and the engineer pasting an entire codebase into a 200,000-token prompt are doing the same thing.

They are reaching for a thicker fluid because the thicker fluid feels safer, and they are missing the regime they are entering. The fix in both cases is the same: stop tuning by intuition and start tuning by the curve. Stribeck drew it in 1902 with a few oiled bearings and a friction sensor. The data we now have on language models shows the same shape. The honest move is to admit we have been doing tribology all along — and start reading the literature.

The Diagnostic Layer Your Agent Doesn’t Have Yet

You can’t tune to the Stribeck minimum if you can’t see which regime your agent is operating in. Most fleets discover the curve only after the bill arrives, because there’s no record of what context was passed, what the agent retrieved, or which tokens were load-bearing for any given decision. Chain of Consciousness adds a signed, append-only entry per agent action — including the context window snapshot — before the action runs. Once the chain exists, regime identification stops being a thought experiment and becomes a query.

pip install chain-of-consciousness
npm install chain-of-consciousness

Try Hosted CoC — signed context records, queryable per agent action.