← Back to blog

Is Your System on a Retrograde Bed?

Marine ice-sheet instability is a precise model of metastable failure: the outage that keeps going after you've removed what caused it. Here's how to tell whether your system sits on a slope that can't self-arrest.

Published June 2026 · 11 min read

Somewhere under West Antarctica, a glacier the size of Florida is retreating, and the unsettling part is this: even if we stopped the warming tomorrow, it might not stop.

Thwaites Glacier, the one the press nicknamed the “Doomsday Glacier,” drains about a tenth of the West Antarctic Ice Sheet and holds something like 65 centimeters of sea-level rise in its catchment. In 2014, a team led by Eric Rignot tracked its grounding line (the line where the glacier lifts off the bedrock and begins to float) using two decades of satellite radar, and the conclusion made headlines: “We have passed the point of no return” (Rignot et al., Geophysical Research Letters, 2014). The reason was not that the warming was unstoppable. It was the shape of the ground. Thwaites was losing its grip on a ridge only 10 to 20 kilometers wide, and behind that ridge the bedrock sloped downhill, inland, into an over-deepened bowl. Once the glacier backed off the ridge, modelers calculated it would have to retreat 300 to 400 kilometers before it found ground high enough to stop, because nothing in between rises far enough to catch it.

Stop the trigger, and it keeps going anyway. If that sentence makes your on-call instincts twitch, good. You have seen this failure. It just wasn't made of ice.

The bed decides everything

The physics here is old and clean. In 1974, the glaciologist Johannes Weertman worked out the stability of a marine ice sheet (one whose base sits below sea level) and found that its fate is governed almost entirely by one thing: the slope of the bed at the grounding line (Weertman, Journal of Glaciology, 1974). If the bed slopes up toward the interior (a “prograde” bed), the ice sheet is stable, even unconditionally so: push the grounding line back and it wants to return. But if the bed slopes down toward the interior, a “retrograde” bed, which is exactly what underlies much of West Antarctica, the configuration is inherently unstable.

Here is why. When a grounding line on a retrograde bed retreats, it moves into deeper water, which means it sits under thicker ice. And the rate at which ice discharges across the grounding line depends very steeply on the thickness there. Christian Schoof made this rigorous in a 2007 paper with a title that tells you the whole story: “Ice sheet grounding line dynamics: Steady states, stability, and hysteresis” (Schoof, Journal of Geophysical Research: Earth Surface, 2007). He proved that no stable grounding-line position is possible on a retrograde slope, and that the flux of ice rises as roughly the fourth or fifth power of the grounding-line thickness. Sit with that exponent for a moment. A small retreat into slightly thicker ice produces a large jump in discharge, which thins the ice, which pushes the grounding line further back into still-thicker ice, which raises the flux again. Each increment of retreat makes the next increment easier. The feedback gain is greater than one.

And the killer property, the one Schoof put in his title, is hysteresis. Once it runs away, you cannot reverse it by restoring the old conditions. Cool the ocean back to where it was when the retreat began and the glacier keeps going, because the thing driving it now is not the temperature. It is the geometry it has retreated into. The trigger has left the building; the bed is running the show.

Every degradation loop has a bed slope

Now translate. A glacier's bed slope is just the gain of its degradation feedback loop, the answer to a single question you can ask of any failure mode in any system: does each increment of failure make the next increment easier, or harder?

If easier, you are on a retrograde bed. Gain greater than one. The failure feeds itself. If harder, you are on a prograde bed, the system leans back toward health on its own, and the failure self-arrests. That distinction, not the size of the initial trigger, is what determines whether a hiccup becomes an outage you can wait out or an outage you have to fight your way out of.

Distributed-systems researchers found their way to this exact picture from the other direction, and gave it a name: metastable failure. In a 2021 paper, Nathan Bronson and colleagues defined it precisely (Bronson et al., “Metastable Failures in Distributed Systems,” HotOS 2021): a trigger pushes the system into a bad state that persists even after the trigger is removed, in which useful throughput collapses and a “sustaining effect” (work amplification, lost efficiency) holds it there. Read their definition next to Schoof's abstract and you are reading two descriptions of the same animal. The trigger leaves; the bad geometry remains; the system will not climb back out on its own.

The canonical example is the retry storm, and it is a retrograde bed in pure form. A transient blip causes some requests to fail. Clients retry, which is sensible for a transient blip. But the retries are additional load, which slows the server, which makes more requests time out, which triggers more retries, which add more load. Each wave of failure makes the next wave larger. Schoof's rising ice flux and the retry storm's rising request volume are the same curve with different axes. So are the others on the list every senior engineer carries: the garbage-collection death spiral (GC pressure leaves less time for useful work, which backs up allocations, which raises GC pressure); connection-pool exhaustion (slow responses hold connections longer, draining the pool, making responses slower); the cache stampede, where one expired key sends every client to the origin at once, slowing the origin, expiring more work. And the patriarch of them all: the 1986 Internet congestion collapse, when load rose, useful throughput fell toward zero, and retransmissions sustained the meltdown until Van Jacobson's congestion control (SIGCOMM, 1988) added the back-pressure that re-shaped the bed.

A sibling idea is worth naming so we can set it aside. Some outages are irreversible because they mutate state: they corrupt data or leak a secret, and the damage persists because something got written that can't be unwritten. That is a real and separate axis, the question will the damage persist? The retrograde bed is the other axis: does each increment make the next easier? One is about residual state; this one is about feedback gain. A system can be on a retrograde bed with no corrupted byte anywhere; the retry storm leaves your data pristine and your service dead.

The root cause is the bed, not the trigger

Here is where the two fields, having walked in from opposite sides, shake hands on the same hard-won lesson, and it is the most useful thing in this essay.

Bronson's team states it flatly: “the root cause of a metastable failure is the sustaining feedback loop, rather than the trigger. Many triggers can lead to the same failure state, so addressing the sustaining effect is much more likely to prevent future outages.” A glaciologist would nod. The instability of Thwaites is not the warm-water pulse that started this retreat; it is the retrograde bed that will convert any sufficient pulse into a runaway. Chase the trigger and you are playing whack-a-mole with an infinite supply of moles. Fix the bed and you are done.

This is why the universal incident-response instinct, “find what caused it, stop that, and it'll recover,” quietly fails on retrograde-bed systems, and fails in a way that burns precious minutes during an outage. You roll back the deploy that triggered the retry storm, and the storm keeps roaring, because the deploy was the warm-water pulse, not the bed. The retries are now sustaining themselves on the load they themselves create. You removed the forcing. The forcing was never the problem.

The number capacity planning gets wrong

If you take one operational metric from glaciology, take this one: the recovery threshold is lower than the failure threshold, and usually far lower.

Because of hysteresis, you do not climb back out of a metastable failure by easing load down to just under the point where it broke. Bronson's group is explicit: the system “remains in the failure state until a big enough corrective action is applied,” so load has to drop well below the level that originally tripped it. There are, in effect, two thresholds: the tripping threshold, where gain crosses one and the system falls in, and the lower re-grounding threshold, where a big enough corrective finally lets it climb out. The gap between them is the hysteresis.

Almost everyone measures the wrong one. We load-test to find where the system breaks and we set our alarms and autoscaling just under it. But that tripping threshold tells you nothing about how to recover. During a real metastable failure, shedding load back to 95 percent of the breaking point does nothing; you may have to shed to 50 percent, or drain to near zero and let it cold-start, before goodput returns. If you have never measured your re-grounding threshold, you will discover it live, at 3 a.m., by trial and error, while the graph stays flat on the floor. The actionable move is to measure it on purpose, in advance: push the system into the failure, then find how far down you have to back off before it actually recovers. That distance is your hysteresis, and it is the number your runbook needs.

Super-linear gain means there is no gradual middle

There is a reason these failures feel like falling off a cliff rather than sliding down a hill. Schoof's flux law is a power law, fourth or fifth power, so a tiny perturbation at the grounding line produces an enormous change in discharge. There is no gentle regime. The system is fine, and then, across a narrow band, it is catastrophic, with almost nothing in between.

Retry amplification has the same shape. If each failed request spawns a few retries and each retry can itself fail and fan out, the load multiplies geometrically, not additively. Super-linear gain is precisely what removes the comfortable middle ground where you would have time to notice a slow degradation and respond. If your degradation feedback is multiplicative (fan-out, exponential client behavior, anything raised to a power) then “gradual” is not a state the physics will let you occupy. You get fine, then a knee, then the floor. Plan for the knee, because you will not get to manage the slope.

You can't move the trigger, but you can re-shape the bed

None of this is doom, and the most encouraging line in the whole subject is also Weertman's: a prograde bed is unconditionally stable. The shape of the ground is not always something you can choose, but in engineered systems, remarkably often, it is. You usually cannot stop triggers from arriving (hardware fails, traffic spikes, a dependency blips). What you can do is engineer the bed so that each increment of failure makes the next one harder instead of easier, turning gain-greater-than-one into gain-less-than-one.

That is exactly what the standard resilience toolkit does, and the bed-slope lens tells you what each tool is for. A retry budget caps total retries so a storm can't feed itself; it flattens the fourth-power curve. Exponential backoff with jitter spreads the retry load out in time so it can't pile into a self-sustaining wave. A circuit breaker is a manufactured pinning point: when errors cross a line, it trips and stops sending load into the failure, forcing the system back toward the stable side before it slides into the bowl. Load-shedding that bites early, before the grounding line, while gain is still climbing toward one, sheds a little to save the whole. Every one of these is a way of building a prograde bed: making the system want to return to health rather than accelerate away from it.

And here is the genuinely hopeful disanalogy, the place the metaphor breaks in your favor. Thwaites, on human timescales, has essentially no re-grounding lever; the corrective needed is larger than anything we can apply. Your service does have one. A metastable failure recovers the instant a big enough corrective lands: a load-shed, a flush, a cold restart. The whole reason to study the glacier is that it shows you, in slow and merciless detail, the failure you actually can prevent and recover, if you build the lever in before you need it.

The glaciologist's checklist

So run the survey the way a glaciologist would, on each of your degradation loops, and do it before anything is moving, because you cannot characterize a grounding line in mid-retreat any more than you can measure a system that is already on the floor.

  1. Find the grounding line. Identify the threshold where the feedback gain crosses one: the load, error rate, or queue depth past which the loop starts feeding itself. That is where your system tips from prograde to retrograde.
  2. Measure the bed geometry. At that point, is the gain actually greater than one? Look at the real curves: retry fan-out per failure, the GC-pressure-to-useful-work ratio, connection-hold time under slow responses. If failure begets more failure, the bed is retrograde.
  3. Measure the hysteresis. Find the re-grounding threshold, how far below the tripping point you must back off to actually recover. Put that number in the runbook, because it is not the number your load test gave you.
  4. Install a pinning point. If the bed is retrograde, build the re-grounding mechanism in advance: a circuit breaker, a retry budget, a shed-to-stable mode, a deliberate prograde ridge the retreat will catch on before it reaches the bowl.

Two fields that never cite each other, cryosphere science and distributed systems, independently discovered the same truth: in a self-amplifying failure, the trigger is a distraction and the geometry is destiny. The glaciologists just can't fix their bed. You can. The question to carry back to your own systems is not “what could trigger an outage?” That list is endless and mostly out of your hands. It is the one Weertman would ask: which way does the bed slope? Find that out while the ice is still grounded.


Sources

  1. J. Weertman, “Stability of the junction of an ice sheet and an ice shelf,” Journal of Glaciology (1974) — a retrograde (interior-deepening) marine bed is inherently unstable; a prograde bed is unconditionally stable.
  2. C. Schoof, “Ice sheet grounding line dynamics: Steady states, stability, and hysteresis,” Journal of Geophysical Research: Earth Surface (2007), doi:10.1029/2006JF000664 — no stable grounding line on a reverse slope; grounding-line flux scales as roughly the fourth-to-fifth power of thickness; hysteresis.
  3. E. Rignot et al., “Widespread, rapid grounding line retreat of Pine Island, Thwaites, Smith, and Kohler glaciers, West Antarctica, from 1992 to 2011,” Geophysical Research Letters (2014), doi:10.1002/2014GL060140 — “point of no return”; Thwaites grip on a 10–20 km ridge, 300–400 km of retreat before higher ground.
  4. N. Bronson, A. Aghayev, A. Charapko, T. Zhu, “Metastable Failures in Distributed Systems,” HotOS 2021 — trigger vs. sustaining effect; failure persists after the trigger is removed; recovery needs a corrective larger than the trip.
  5. V. Jacobson, “Congestion Avoidance and Control,” SIGCOMM 1988 — the back-pressure response to the 1986 Internet congestion collapse.

An agent fleet can sit on a retrograde bed too. Build the pinning point before the retreat.

When one agent's bad output becomes another agent's trusted input, failure can feed itself: a wrong result gets cited, amplified, and acted on, and each step makes the next error easier. That's a retrograde bed with no corrupted byte in sight. The Agent Trust Stack is a pinning point for it: provenance so a claim's origin is checkable, reputation so an unreliable source loses weight instead of compounding, and verification that trips before a bad result propagates downstream — a prograde ridge that makes each failure harder, not easier.

Vibe Agent Making · Verify a chain · pip install agent-trust-stack  ·  npm install agent-trust-stack