Fretting Wear in Continuous Integration: When Coupled Systems Fail Without Visible Symptoms

A bolted aircraft skin panel does not loosen because anyone hits it. It loosens because the airframe vibrates a few microns per cycle for fifty million cycles, and every one of those cycles is nominally beneath any threshold an inspector would write down. The bolt holds full clamping force. The fastener torque audit passes. The skin passes visual inspection. Then one day a hairline crack walks out from under the bolt head and the panel is condemned — and the post-mortem shows red iron oxide caked into the contact patch like rust dust mixed with chocolate. Tribologists call this fretting wear, and the brown-red debris is called “cocoa.” It is the canonical failure mode of clamped components that vibrate against each other at amplitudes too small for any single inspection to detect.

The same failure mode runs in software, and almost nobody calls it by that name.

The clean test rig under-predicts the dirty system by orders of magnitude — exactly as the tribology literature says it must.

The Mechanism, Briefly

Fretting is small-amplitude oscillatory sliding under normal load, typically 10 to 300 micrometres of relative motion per cycle. The combination of “clamped” and “vibrating” is what makes the mechanism uniquely destructive. The clamp prevents the wear debris from escaping. The vibration generates fresh debris on every cycle. The trapped debris then participates in subsequent cycles as both an abrasive grit and a stress concentrator, and on steel it oxidises into a chemically active red oxide that itself becomes a fresh abrasive.

What makes fretting taxonomically interesting is that it is, simultaneously, all four of Rabinowicz’s canonical wear modes — adhesive welding at the stick patches, abrasive ploughing by entrained debris, fatigue crack initiation at the wear scar (which cuts component fatigue life by a factor of three to ten), and tribochemical reaction in the form of the cocoa itself. The standard reference on the subject — Hutchings & Shipway’s Tribology: Friction and Wear of Engineering Materials (2017, Chapter 8), supported by ASM Handbook Volume 18 — emphasises a brutal fact: fretting survives undiagnosed in the majority of high-cycle machines because the component-level wear tests engineers run in isolated geometries systematically understate system-level wear by orders of magnitude. The test rig measures one mechanism on a clean specimen. The field measures all four mechanisms operating on a contaminated, debris-loaded interface that has accumulated state for months.

This last sentence is also a perfect description of how production software systems fail.

The CI/CD Pipeline as a Fretting Regime

Continuous integration pipelines impose exactly the regime fretting requires. A modern team ships ten to a hundred small changes per day. Each change is a low-amplitude perturbation of a system that is otherwise clamped — clamped by SLA commitments, schema contracts, dependency pins, feature-flag defaults, deployment topologies that do not move. The system vibrates because the deltas are constant: feature-flag toggles, rollbacks, schema migrations forward and backward, dependency bumps, env-var hotpatches, traffic-shifting between regions, blue-green flips. Each individual change is well below the amplitude that any single test was written to detect. Each change passes its unit tests. Each passes integration. Each passes canary. Each passes staged rollout.

And yet failures happen, and the post-mortem keeps producing the same shape of story: a database migration three weeks ago left behind a column in a particular state. A feature flag was toggled twice last Tuesday. A dependency was bumped from 4.2.1 to 4.2.2 on a tangential service. A new env-var was added with a sensible default. Each of these was, individually, harmless. The crash today is the conjunction. It looks like the four wear modes are running against the codebase in parallel.

Adhesive: the coupling failure

Two services that were supposed to be loosely coupled have stuck together at an interface — a shared utility library, a shared cache key, a shared assumption about what an “active user” means. Every small motion peels material from one and welds it to the other. Eventually one of them tears off a piece of the other. This is the failure mode where the post-mortem reads “Service A was supposed to be independent of Service B, but it turns out Service A was relying on a side-effect of Service B’s deployment timing.”

Abrasive: the debris-as-grit failure

Old data, orphan rows, deprecated fields, half-migrated tables, dead feature flags that were not cleaned up, stale entries in retry queues — all of this is wear debris from prior changes. It does not cause problems on the day it is created. It causes problems three weeks later when a new code path encounters an unexpected shape and ploughs through it, scattering exceptions. The team patches the new code path. The debris remains. The next code path will encounter it too.

Fatigue-initiating: the stress-concentrator failure

Each prior change has left a small notch in the system — a workaround, a special-case branch, a “TODO: handle this later,” a clause that swallows one specific exception class. These are not bugs; they are scars. Each scar is a stress concentrator for the next change. After enough cycles, a crack initiates not at the most-stressed point but at the scarriest one. This is the well-known phenomenon where the most complicated module is also the most-changed and the most-broken — the scars accumulate where the cycling is heaviest.

Tribochemical: the cocoa failure

In steel-on-steel fretting, the iron debris oxidises in air into ferric oxide — the famous reddish-brown powder — which is harder than the steel that produced it. The new oxide grit grinds away more steel, which oxidises into more grit. The system is now generating its own abrasive. In CI/CD, the equivalent is the way error-handling code, monitoring instrumentation, retry logic, and fallback paths all themselves become attack surface and failure modes. A retry loop introduced to handle an upstream flake becomes the thing that hammers a database into a slow query path. A circuit breaker introduced to protect a service becomes the reason a feature is mysteriously unavailable for a subset of users. A monitor introduced to catch a class of error becomes a memory leak under sustained load. The cocoa is the defensive instrumentation that gets contaminated, hardens, and then becomes the grit that wears down the next layer.

The thing that makes this the fretting analogy specifically — and not just the general “complex systems fail in complex ways” cliché — is that all four mechanisms run at the same time, on the same component, at the same low amplitude, under the same nominally-acceptable conditions. No single mechanism is responsible. No single change is responsible. The crash is a third-order interaction: debris from event A becomes a stress concentrator for event B, which initiates a crack in shared component C, which oxidises into a runtime panic that only manifests under specific concurrency pattern D — none of which had a test, because none of them existed at the time the tests were written.

Why the Test Rig Understates the Field

The single most important sentence in Hutchings & Shipway’s fretting chapter — and the one most relevant to software engineering — is the observation that component-level wear tests, run in clean isolated geometries, understate system-level wear by orders of magnitude. The pin-on-disc tribometer measures one mechanism on one surface pair, with the debris escaping the contact zone freely. It cannot reproduce the trapped-debris, multi-cycle, mixed-mode regime of an actual bolted joint in service. To get accurate numbers, you have to test the assembled system under the actual loading and environmental conditions, for the actual time scales involved, and accept that this is approximately a hundred times more expensive than the component test.

Software unit tests are pin-on-disc tribometers. They isolate one function in one configuration with all the debris (mocks, fixtures, in-memory databases) explicitly removed. They produce a clean wear coefficient — pass/fail — for the function under that single condition. Integration tests are slightly larger rigs that include a few neighbouring components but still scrub the environment between runs. End-to-end tests are larger still but typically reset state on every run, which is to say they remove the debris.

Production is the only environment that contains the historical residue of every change made for the entire lifetime of the system. The schema in production is not the schema in any test environment, because production has columns that nobody remembers adding, indexes that were created for a one-time backfill in 2023, foreign keys that point at a table that was renamed and then renamed back. The cache in production is not the cache in any test environment, because it has been warm for months and has stale entries written by services that have since been decommissioned. The traffic in production is not the traffic in any test environment, because it includes the long tail of clients with bizarre user-agents, broken retry loops, and version-pinned SDKs from three years ago.

This is the failure mode the cleanroom cannot see. The clean test rig under-predicts the dirty system by orders of magnitude — exactly as the tribology literature says it must.

The Diagnostic Asymmetry

The most useful thing tribologists do — and the thing software engineering has barely learned to do — is post-mortem morphology analysis on the wear scar. The classical move is to take the failed component, cross-section the wear region, polish it, mount it under an SEM, and read off the particle morphology of the debris. Equiaxed angular particles with a hardness signature higher than the parent metal indicate abrasive wear. Flake-shaped particles with rolled edges indicate adhesive transfer. A subsurface crack network with a characteristic angle indicates fatigue. Oxide layers of specific colours and thicknesses indicate tribochemical activity. From the morphology of what is left behind, you can reconstruct which of the four mechanisms dominated, when, and under what conditions.

Software engineering’s equivalent — the production post-mortem — typically does only the last 24 hours of evidence and then patches the most-proximate cause. The deeper morphology questions go unasked: how many of our recent incidents share a substrate of accumulated debris from migrations that were never fully cleaned up? How many of our shared utilities are scarred by stress-concentrating special cases that make them increasingly fragile to any subsequent change? How much of our defensive instrumentation has become the contaminant that propagates failure rather than catching it? The tribologist asks: what does the wear surface tell me about the regime that produced it. The site reliability engineer asks: what was the immediate cause of the page. These are very different questions, and only the first one reveals the systemic mechanism.

What a Tribology of CI/CD Would Look Like

If a serious engineering organisation took the fretting analogy as a working hypothesis — that their CI/CD pipeline is producing low-amplitude cyclic wear on coupled subsystems, and that the dominant failure mode is a multi-mechanism third-order interaction — a small number of practices would change.

First, debris audits would become a routine maintenance activity rather than an emergency response. Schema columns that no code reads, feature flags that have not been toggled in six months, dependency entries that no module imports, env-vars that are set but never read, monitors that have not fired in a year, retry handlers that catch exception classes nobody throws anymore. Every one of these is wear debris. The standard practice today is to leave it in place because removing it is “risky.” That is exactly the failure mode: the debris accumulates, and the system becomes increasingly fragile to the next change.

Second, long-duration canaries would replace point-in-time tests. A canary that runs for fifteen minutes and serves a hundred requests does not exercise the cocoa-debris mechanism. A canary that runs for a week and serves a million requests, against a production-aged dataset, with the full historical sediment of the cache and the database visible to it, does. The expensive long-canary is the integration-rig test that the tribology literature demands; the cheap point-in-time canary is the pin-on-disc test that systematically under-predicts.

Third, post-mortems would include morphology analysis on top of root cause. Not “what caused this incident” but “what does the shape of this failure tell us about the regime that produced it.” The same way a metallurgist reads a wear scar to infer mechanism, a site reliability engineer could read an incident timeline to infer whether the dominant pattern is adhesive (unexpected coupling between modules), abrasive (debris from old changes), fatigue (concentrated stress at scar sites), or tribochemical (defensive instrumentation gone wrong). Most production incidents read this way look mostly tribochemical and partly abrasive — the failure mode runs through the layers of monitoring, retry, and circuit-breaker code that were supposed to protect the system, and the trigger is some piece of accumulated state nobody owns.

Fourth, system-level wear coefficients would replace component-level pass rates. “We have 98% test coverage and zero failed unit tests” is a pin-on-disc number. The system-level number would be something like “incidents per ten thousand deploys traceable to multi-event interactions older than two weeks.” That number, if measured, would be the wear coefficient of the CI/CD regime itself, and it is the only number that tells the team whether the rate of debris accumulation is sustainable or not.

The Larger Pattern

The deeper observation underneath the fretting analogy is structural. Whenever a system has many small things touching each other under load with constant low-amplitude perturbation, the canonical failure mode is not in any individual thing. It is in the coupling, the trapped debris, and the third-order interaction. This is true of bolted aircraft skins, of dental implant-abutment interfaces, of press-fit shafts in turbomachinery, and of every piece of high-velocity software infrastructure currently running in production. The tribology literature has been writing this down since the 1960s. The software industry’s institutional memory of the same lesson is mostly post-mortems that nobody reads twice and a folk wisdom that “complex systems fail in complex ways,” which is true but not actionable.

The actionable version is the tribologist’s version: identify the regime, instrument the right contact patch, audit the debris, read the morphology, and stop trusting the clean-rig test to tell you what the dirty system is doing. The fretting failure is the one your CI/CD pipeline is generating right now, on the components you are most certain are well-tested. The reason you have not seen it yet is not that it isn’t happening. It is that it leaves the same kind of trace fretting wear leaves on steel — a quiet brown stain at the contact patch, building cycle by cycle below the threshold of any single inspection, until the day the panel comes off.

Sources: Hutchings & Shipway, Tribology: Friction and Wear of Engineering Materials, 2nd ed. (Butterworth-Heinemann, 2017), Chapter 8; ASM Handbook Volume 18, “Friction, Lubrication, and Wear Technology”; Rabinowicz, Friction and Wear of Materials, 2nd ed. (Wiley, 1995) for the four canonical wear mode taxonomy.

Read the Wear Scar, Not Just the Last 24 Hours

The essay’s third practice change — post-mortem morphology analysis — is a provenance problem more than an instrumentation one. To read the shape of a failure across the system’s historical sediment, you need an append-only signed record of what happened: every deploy, every flag toggle, every schema migration, every retry, every monitor firing, every dependency bump, sliceable by component and by time. That is what Chain of Consciousness is for. Run morphology queries across the chain the same way a tribologist runs SEM analysis across a wear scar. The substrate stops being “the last 24 hours of logs” and starts being the full historical record of the contact patch.

pip install chain-of-consciousness npm install chain-of-consciousness

Hosted Chain of Consciousness ships the audit-trail substrate as a service. The cocoa stops being invisible.