The South Atlantic Anomaly of Production Systems

Earth’s magnetic field is dying in one specific place. Production systems fail the same way — locally, non-uniformly, and with the recovery mechanism participating in the failure. A century of geomagnetism has already shipped the monitoring playbook.

April 2026 · 12 min read

Earth’s magnetic field is dying in one specific place. Between 2014 and 2025, ESA’s Swarm satellite constellation watched a weak spot over the South Atlantic expand by an area “nearly half the size of continental Europe” — roughly two million square miles of degraded field. The rest of the field is mostly fine. Compasses still work in Toronto and Tokyo. Aurorae still arc over the poles. The field is failing locally, in one region, while operating normally everywhere else.

This is the South Atlantic Anomaly, and once you see it, you start seeing it in your production systems too.

The shape of a localized failure

The anomaly is not a single hole. ESA’s Chris Finlay, lead author on a paper in Physics of the Earth and Planetary Interiors in October 2025, told the agency this directly: “The South Atlantic Anomaly is not just a single block. It’s changing differently towards Africa than it is near South America.” Since 2020, a sub-region southwest of Africa has been weakening faster than the rest of the anomaly. “There’s something special happening in this region,” Finlay said — and “something special” in geomagnetism means a reverse-flux patch, an area at the core-mantle boundary where field lines drive back into the core rather than emerging from it, locally canceling the main dipole.

The non-uniform expansion is the killer detail. The anomaly grows differently in different directions. While the South Atlantic loses field strength, Siberia’s strong-field region gained 0.42% of Earth’s surface area — comparable to Greenland — between 2014 and 2025. The strong-field region over Canada shrank by 0.65%, an area roughly the size of India. The aggregate field equations did not change. The spatial distribution reorganized.

Up in low Earth orbit, this matters operationally. Satellites passing through the SAA receive elevated radiation doses; the Hubble Space Telescope, the ISS, and dozens of science missions disable sensitive instruments during SAA passes. The anomaly does not exist only as a number on a chart — it forces operational accommodations across every spacecraft on Earth.

Production systems fail this way too

On November 25, 2020, AWS Kinesis Data Streams in us-east-1 began returning elevated API errors at 6:36 AM PST. By 7:30 AM, the failure had cascaded into CloudWatch, Cognito, IoT Core, and EventBridge — services that depended on Kinesis for telemetry or event routing. Hours later, 1Password, Coinbase, Adobe Spark, Roku, and The Washington Post were degraded. The shape: one service in one region degrades; dependent services in the same region degrade; customer-facing applications fail globally.

In October 2025, a single DNS resolution error in DynamoDB took out Snapchat, Venmo, Canva, Fortnite, Roblox, Reddit, Disney+, and Amazon Alexa for over fifteen hours. As CNN reported it: “That single bottleneck created five more failures, which created twenty-five more.” The geometric multiplication — 1, 5, 25 — is the production-system equivalent of reverse-flux patches at the core-mantle boundary. Each failure point locally inverts the field. Healthy dependencies become failure sources. The inversions propagate outward.

The Facebook outage of October 4, 2021 is the cleanest case study. Cloudflare, watching from outside the perimeter, reconstructed the timeline in a public post-mortem. At 15:40 UTC, a configuration change intended to assess backbone capacity unintentionally took down all connections in Facebook’s backbone network. At 15:51 UTC, Cloudflare’s resolvers began returning SERVFAIL on Facebook DNS lookups. By 15:58 UTC — eighteen minutes from the original change — Facebook had stopped announcing routes to its DNS prefixes and disappeared from the internet.

Here is where the morphology gets uncomfortable. Facebook’s DNS servers were designed to withdraw their BGP advertisements when they could not reach the data centers behind them — a health check, the correct behavior; you do not want to advertise routes to dead servers. But when the backbone fell, every DNS server triggered the same check at the same time and voluntarily disconnected. Cloudflare’s engineers wrote it plainly: “It was as if someone had ‘pulled the cables’ from their data centers all at once.” The monitoring system was the accelerant. The protection mechanism was the failure mechanism.

The pattern shows up wherever recovery loops compound. Square’s Redis tier became briefly unavailable in March 2017; a retry loop attempted up to 500 consecutive retries without backoff, and the recovery mechanism “effectively DOSed the service.” In each case the same five steps fire: a globally distributed system operates normally; degradation concentrates in one location; the anomaly grows non-uniformly; compensatory mechanisms either mask or amplify the problem; the failure either resolves or generalizes. That sequence is the SAA’s morphology written in YAML.

The monitoring playbook geomagnetism already shipped

Geomagnetism has been doing this longer than we have. At every point on Earth’s surface, three scalar quantities describe the local field: declination D (how far magnetic north deviates from true north), inclination I (the dip below horizontal), and total intensity F (the scalar field magnitude, typically 25,000–65,000 nanotesla at the surface). These three numbers do not require a model of the underlying dynamo. You measure them where you are, and compare to a reference.

The mapping is direct enough to be useful:

Geomagnetic	Production system	What it catches
Declination D — deviation from true north	Drift from baseline behavior — distance from the model’s prediction	The system is “pointing” somewhere different than the model expects
Total intensity F — scalar field strength	Aggregate throughput / capacity utilization	Weakening: the system has less capacity than it should
Drift speed dD/dt, dF/dt — rate of change	Rate of change, not just current value	Acceleration: the degradation is speeding up, not just present

The reference is the World Magnetic Model. NOAA’s National Centers for Environmental Information and the British Geological Survey jointly issue a new WMM every five years. WMM2025 was released December 17, 2024, valid through late 2029, with 175 spherical-harmonic coefficients to degree and order 12. A high-resolution variant, WMMHR2025, goes to degree and order 133 — roughly 300-kilometer spatial resolution. When you open a navigation app and the compass arrow stabilizes, the underlying declination correction is computed from those coefficients.

There is a contour, called the agonic line, where declination is exactly zero — where magnetic north equals true north. On either side, declination has opposite sign. The line currently runs through the central United States and across the western Atlantic, and it migrates over time. The agonic line is what the field considers normal at that location — the boundary between “compass points east of true” and “compass points west of true.” Cross it without updating your model and your compass starts lying to you.

Production systems have agonic lines too. We call them SLOs. Like the magnetic version, they move.

The 2019 off-cycle update

The five-year WMM cadence is a deliberate compromise. Every refresh forces every downstream consumer — aviation chart providers, ICAO, NATO, smartphone APIs, airport runway numbering — to update. Too-frequent refreshes impose coordination costs across enormous user populations. Too-infrequent refreshes accumulate error.

Reality broke the model in three.

In February 2019, NOAA published an unprecedented out-of-cycle WMM update. The reason was operational: the magnetic north pole had been drifting toward Siberia at roughly 55 kilometers per year — faster than WMM2015’s linear secular variation terms had predicted — and the model’s declination errors near the Arctic had grown beyond operational tolerance before the scheduled 2020 replacement. The triggering finding, in NOAA’s language, was “intense nonlinear core field variations following the release of WMM2015,” with the effect “geometrically amplified near magnetic dip poles.” Small changes in field components produce large changes in declination angle when the field is nearly vertical — exactly where the polar consumers most needed the model to be accurate.

The 2019 update is, in software terms, an emergency hotfix release of a system used by NATO, the FAA, ICAO, both major mobile OS vendors, and Denver International Airport — which the NCEI release noted had seen its declination shift “just over 2.5 degrees over the past 22 years.” The same release introduced a new concept: “blackout zones” near the magnetic dip poles where WMM declination is inaccurate and compasses cannot be trusted at all. Geomagnetism’s response to model failure was not to refit harder. It was to mark certain regions degraded — to admit that for some inputs, the model has no truthful answer.

There is a translation key. Five-year cadence becomes annual SLO review. Linear secular variation becomes the assumption that current trends extrapolate. Geometric amplification near dip poles becomes the fact that systems near saturation amplify small perturbations: a 5% traffic increase on a server at 50% load is invisible; the same 5% at 95% load produces visible latency spikes. Off-cycle update becomes emergency architecture review when reality breaks the model. Blackout zones become the willingness to mark endpoints as untrustworthy — to tell the dashboard “we cannot interpret these metrics here.”

The deeper rule: the monitoring model has a validity epoch, and reality does not wait for the refresh schedule.

Where the analogy breaks

A careful reader should already be flagging objections. Earth’s geodynamo is one coupled physical system; production systems are loose ensembles of services with human operators. The SAA evolves on geological timescales; AWS Kinesis evolved on hourly timescales. There is also a harder objection from paleomagnetism: a 2022 PNAS paper by Nilsson and colleagues reconstructed the geomagnetic field over the past 9,000 years and found SAA-like asymmetries recur on a roughly 1,300-year cycle, with the closest ancient analog around 600 BCE. The authors predict the current SAA “will likely disappear in the next few hundred years” without a reversal — though they also note that the dipole moment is still decreasing without sign of slowing, “which is not consistent with the proposed ancient analog and highlights its limitations.”

The recurrence finding cuts both ways. It suggests the SAA may be a normal operating mode of the dynamo rather than a precursor of failure — and some production-system anomalies are likewise cyclical and self-resolving. The lesson would be: not every red on the dashboard requires intervention; some patterns are features. But geomagnetic recurrence happens because the underlying dynamo has stable convective patterns over millennia. Production systems have nothing that durable. A pattern that recurs every six months in your traffic is not the same kind of object as a 1,300-year mode in core convection.

So the analogy transfers structure, not substance. The SAA is not literally what is happening to your shard. What it offers is a vocabulary — agonic line, blackout zone, geometric amplification, off-cycle update, reverse-flux patch — and a checklist of failure modes catalogued by a community of geophysicists over a century. The vocabulary is portable. The substance is local.

The compensatory trap

One transferred concept deserves its own section: the recovery mechanism that becomes the failure mechanism.

In geomagnetism, the SAA is partly constituted by reverse-flux patches at the core-mantle boundary that locally cancel the main field. The “patch” and the “anomaly” are the same physical phenomenon viewed at different scales. The thing reducing field strength is not external to the field — it is the field’s own dynamics producing locally inverted flux.

The software pattern matches. Facebook’s DNS health check did exactly what it was designed to do; it was the accelerant. Square’s retry loop was the recovery path; it DOSed the service. The 2015 AWS DynamoDB cascade ran on retry pressure: storage servers removed themselves from service “and continued retrying requests,” eventually overwhelming the metadata service so badly that operators “had to firewall it off” to add capacity.

The structural lesson is the same in both domains. When the recovery mechanism is part of the failure mechanism, you cannot separate them at runtime. You have to see them together at design time. Health-check thresholds, retry budgets, and circuit-breaker policies are not orthogonal to your failure model — they ARE part of it. A health check that withdraws capacity from a degraded region is a feature; a health check that simultaneously withdraws capacity from every region during a backbone fault is the whole anomaly.

A practical port

INTERMAGNET — the international consortium of around 150 ground-based magnetic observatories — provides one-second-cadence measurements at fixed locations, each reporting D, I, F against the WMM. Anomalies are detected as deviation from model, not as crossings of absolute thresholds. A direct port:

Three scalar measurements per endpoint, recorded continuously — latency, error rate, throughput. Not derived metrics; the primitive, unit-bearing quantities. A reading is interpreted relative to the local model, not a global threshold (50,000 nT is normal at high latitudes, anomalous at the equator; 50ms p95 is normal for this endpoint, anomalous for that one).
A reference model per endpoint, with explicit valid-from / valid-through dates baked into the threshold config. Not “p95 latency under 200ms forever” but “p95 latency under 200ms, valid 2026-Q2, expires 2026-09-30.” When the date passes, the alert graying out forces a refresh decision.
Drift speed as a first-class signal, not just current value. If declination is drifting at three degrees per decade, you do not wait until the angle exceeds chart tolerance. You watch dD/dt. The production analog: alert on the slope, not just the level.
An off-cycle trigger condition documented in advance. NOAA’s WMM rule was effectively “if declination error exceeds operational tolerance before the next scheduled release, ship out-of-cycle.” The team equivalent: if drift speed for any endpoint exceeds a defined slope for two consecutive review windows, schedule an unscheduled architecture review. The trigger has to be written down before the incident, not invented during it.
Blackout zones, declared and visible. When an endpoint or region is in an operating regime where the monitoring model lies, mark it that way on the dashboard. Do not silently let stale thresholds fire alerts nobody believes; that is how alert fatigue seeds the next outage.

This is not a finished framework. It is a porting exercise. The geomagnetism community has been doing the equivalent for a century, with public data and an open archive. There is more to copy.

Closing

Earth’s magnetic field is dying in one specific place. It has been observable for decades. It is not getting better. Every satellite mission on Earth plans around it. The community that watches it has a vocabulary, a model, a refresh cadence, an off-cycle trigger, and a concept of regions where the model cannot be trusted.

Production systems fail in one specific place too. The us-east-1 region. The metadata service. The broker that hit the EC2 network limit first. The DNS server whose health check fired before the rest. The failure is local before it is global. The monitoring model has a validity epoch shorter than its refresh cadence, and the recovery mechanism is part of the failure mechanism.

Watch the agonic line, not the average. The agonic line moves. Your SLO boundary should too.

Sources: ESA Swarm constellation data 2014–2025 (SAA expansion ~2 million sq mi; Siberia +0.42% area-of-Earth, Canada −0.65%). Finlay et al., Physics of the Earth and Planetary Interiors, October 2025 (reverse-flux patch finding, sub-region southwest of Africa). Nilsson et al., PNAS, 2022 (9,000-year paleomagnetic reconstruction; ~1,300-year SAA-like recurrence; closest ancient analog ~600 BCE). NOAA NCEI / British Geological Survey: WMM2025 release December 17, 2024 (175 SH coefficients, deg/order 12); WMMHR2025 (deg/order 133, ~300 km resolution); February 2019 out-of-cycle WMM update; Denver International runway declination shift “just over 2.5 degrees over the past 22 years.” INTERMAGNET ~150 ground observatories, 1-second cadence. AWS Kinesis us-east-1 outage post-mortem, November 25, 2020. AWS DynamoDB DNS resolution cascade, October 2025 (CNN coverage with 1→5→25 framing). Cloudflare blog post-mortem of Facebook outage, October 4, 2021 (15:40 / 15:51 / 15:58 UTC timeline; “pulled the cables” quote). Square engineering blog, March 2017 Redis 500-retry incident. AWS DynamoDB September 2015 metadata-service incident post-mortem.

A reference model with valid-from / valid-through dates needs a provenance chain to anchor it.

The five-bullet practical port turns on a single primitive: every alert decision must be reproducible against the reference model that was active at the time the measurement was taken. That requires a hash-linked record of which model version, which threshold config, and which valid-through window was in force at each instant — the WMM equivalent for your dashboards. Chain of Consciousness is the open-source provenance substrate for that record: an append-only chain that signs every action, model-version swap, and threshold update with its source, scope, and timestamp. Off-cycle update or routine refresh, the chain shows when the agonic line moved and who moved it.

Install: pip install chain-of-consciousness or npm install chain-of-consciousness

Hosted Chain of Consciousness · Verify a provenance chain · Follow a claim through its evidence

← Back to all posts