You cannot refactor your way out while the building is on fire. The rewrite fails not because it's wrong, but because it's sequenced wrong.
On August 6, 1997, at Macworld Boston, Steve Jobs stood in front of the Apple faithful and told them their company's savior would be Microsoft. A hundred and fifty million dollars of investment from the enemy, plus a commitment to keep shipping Office for the Mac. The crowd booed. On the giant screen behind Jobs, Bill Gates loomed over the auditorium like a conqueror accepting a surrender, and the people who loved Apple most understood the moment as humiliation.
Here is what the booing crowd had wrong, and what makes that day worth studying twenty-nine years later. Apple was roughly ninety days from running out of money. Everyone remembers the turnaround as “Think Different” and the iMac: the strategy, the vision, the rebirth. But the vision came after. First came the unglamorous triage: take the cash, even from Redmond, because cash is oxygen. Kill the sprawling product line, dozens of overlapping Macs collapsed into a two-by-two grid: consumer and pro, desktop and portable. Stop the bleeding everywhere it bled. “Think Different” launched that September; the iMac didn't arrive until May 1998. The order was not incidental. Jobs ran the sequence that every corporate-turnaround playbook treats as non-negotiable, and that engineering teams in technical crisis violate almost every time: stabilize first, operate second, strategize third.
Your team has probably violated it. Most of us have. It's the team drowning in outages that decides the real answer is the big rewrite, and starts the rewrite while still drowning. It's “we'll fix the architecture” announced in the middle of the fire. And it fails, predictably, for a reason that isn't about discipline or talent. It fails because of the shape of the loop you're inside.
The corporate-restructuring world has the benefit of a brutal scoreboard (companies in crisis either survive or they don't) and decades of that scoreboard have produced a literature that is strikingly unanimous about sequence. Stuart Slatter and David Lovett's Corporate Turnaround, the standard text, lays out seven essential ingredients, and they are ordered: crisis stabilization first, then leadership, then stakeholder management, and strategic focus sits at number four. The repositioning, the new vision, the bold pivot: explicitly forbidden until the bleeding has stopped. In the stabilization phase, the doctrine is blunt: cash is prioritized over profit. Not growth, not margin, not the long term. Runway.
The signature instrument of this phase is something the turnaround trade calls the holy grail: the 13-week cash-flow forecast. Within the first week on the job, day three to day seven, a crisis officer builds a rough model of exactly when the company runs out of money. Which week payroll breaks. Which vendor payment bounces first. It is crude on purpose, and its function is psychological as much as financial: it converts a fog of dread into a dated deadline. You cannot prioritize triage until you know precisely how long you have.
Notice what the 13-week model is not: it is not the fix. It repairs nothing. It exists to make one quantity visible and managed, slack, the distance between you and the cliff edge, because every move that follows is going to spend it.
Now put that next to the engineering version, because the engineering world has its own canonical description of the same disease, written by Google's Site Reliability Engineering organization, the people who professionalized keeping large systems alive.
The SRE books make a deceptively simple observation about operational labor, the manual, repetitive work they call toil: toil scales linearly with the service (more users, more servers, more tickets, more pages) while engineering scales sublinearly, because automation you build once keeps paying off. From that asymmetry they derive a warning that deserves to be read as slowly as any balance sheet: if toil is allowed to exceed about half the team's time, the team has too little engineering time left to keep up, so the toil grows further, and the team enters a death spiral where it spends ever more time on operational work.
That sentence is the reliability version of insolvency. Outages consume the engineering time that would prevent outages. Firefighting leaves no time to fix root causes, so the root causes keep firing, which leaves no time. The team trapped inside it cannot “just refactor” or “just fix the architecture,” and the reason is not weakness: it's that the spiral has already eaten the slack the fix requires. Every hour the rewrite needs is an hour the pager already owns.
Google's answer was not a pep talk. It was policy, and both halves of it are stabilization mechanics wearing engineering clothes. First, the famous 50% cap: operational work is capped at half of every SRE's time, with the rest reserved for engineering, automation, hardening, the fixes that come out of postmortems. That reserved half is deliberately protected slack, a standing runway that exists so the team can always build remedies faster than the service generates new toil. Second, the error-budget policy: when a service burns through its reliability budget, feature launches stop, automatically, by prior agreement, and effort redirects to reliability until the service is back within its objectives. That is a feature freeze with the politics pre-negotiated, a fire-stopping move that doesn't require a hero to demand it in the middle of the fire.
It would be easy to leave this as a pleasing parallel: companies bleed cash, teams bleed engineering hours, both should stabilize. But the connection is stronger than resemblance, and there's a formal result underneath it.
In 2001, Nelson Repenning and John Sterman, system-dynamics researchers at MIT, published a paper whose title alone explains half the failed improvement initiatives you've ever watched: “Nobody Ever Gets Credit for Fixing Problems That Never Happened.” In it they model what they call the capability trap. A team's effort is a finite stock, split between two flows: working, producing output, fighting today's fires, and improving, building the capability that prevents tomorrow's. When capability is low, problems multiply, which pulls effort into working, which starves improving, which lets capability decay further, which multiplies problems. The loop is reinforcing, and it has a basin: once you're inside, every locally rational hour spent firefighting deepens the trap.
The model is deliberately domain-independent. Run it with cash as the stock and it's a company in decline: losses force cuts that worsen the product that deepens the losses. Run it with engineering hours as the stock and it's the toil spiral, almost line for line. Peter Senge had named the underlying skeleton a decade earlier as the “shifting the burden” archetype: the symptomatic fix (the heroic all-nighter, the manual patch, the emergency discount) relieves pressure, which quietly removes the urgency, and eventually the ability, to apply the fundamental fix, until the organization depends permanently on its own firefighting.
Which licenses a claim that sounds extravagant and is just true: the CRO's 13-week cash model and Google's 50% toil cap are the same instrument, a control lever on the same loop. Both manufacture and defend a measured quantity of slack, because slack is the only thing that lets any effort flow back into the improving channel at all. One culture calls it runway; the other calls it error budget and engineering time. The substances differ (dollars are not engineer-hours, and the analogy shouldn't be pushed past its joint) but the feedback structure is identical, and the intervention point is identical too.
If stabilization-first is this well established, why does every generation of engineering leadership rediscover the fire the hard way? The system-dynamics work has an unnervingly good answer, and it's not stupidity.
It's called worse-before-better, and it is the trap's cruelest feature. Repenning and Sterman showed that when a trapped team starts investing in improvement, the first result is negative: the hours diverted to the fundamental fix come out of firefighting, so in the short run there are fewer hands on the fires and things visibly degrade. The cure hurts before it helps. A team with abundant slack can ride out the dip. A team already in the spiral, already starved, watches the early returns come in worse and rationally abandons the cure, retreating to the pager and calling it pragmatism. Now read “just refactor while on fire” through that lens: the rewrite is a fundamental fix attempted with zero slack; worse-before-better guarantees it degrades the present before it improves the future; the still-burning crisis eats the margin the team doesn't have; the rewrite is abandoned at the bottom of the dip, having made everything worse. The failure was not the rewrite. It was the sequencing, attempting phase three from inside phase one.
Two more findings complete the picture. First, the incentive problem in Repenning and Sterman's title: organizations promote the heroes of visible saves, not the engineers whose prevention meant nothing visibly happened. The on-call firefighter gets the spot bonus; the person who quietly deleted an entire class of outage gets nothing measurable. That reward gradient is the spiral's fuel line, it pulls talent and effort toward the symptomatic channel forever. Second, and most clarifying: studies of corporate turnarounds (Schweizer and Nienhaus's 2017 review among them) find that failing firms don't restructure less than survivors, they often restructure more intensively, with similar strategies. The difference that predicts survival isn't effort or even strategy choice; it's execution and sequence, with speed of stabilization the dominant variable: companies that stabilize cash within about ninety days survive at markedly higher rates. The doomed team is rarely the lazy one. It's the one doing the right fix in the wrong order, harder and harder, while the loop compounds underneath it. Delay is multiplicative. The spiral doesn't pause while you do the right thing slowly.
The general shape, in both vocabularies:
Declare it. The turnaround begins when someone is named to the crisis and denial officially ends; the incident begins when someone declares the SEV. The declaration matters because it legitimizes everything that follows, especially the unpopular freezes.
Freeze selectively. The corporate version freezes discretionary spend while protecting payroll and the critical vendors: stop the bleed, keep the heart beating. The engineering version is the error-budget freeze: feature launches stop; reliability work, and only reliability work, proceeds. A total halt is not stabilization; it's a different way to die.
Make the runway measurable and dated. The 13-week model; or, in engineering terms, the toil audit and the error budget: how many engineer-hours a week does the pager actually own? What fraction of the team's capacity is already spoken for before any project starts? Vague dread becomes a number with a date, and the number tells you whether the rewrite you're contemplating is funded in the only currency that matters.
Buy slack crudely. This is the step engineering pride resists. Emergency liquidity in a turnaround is expensive and undignified: DIP financing, asset sales, the Microsoft check while the crowd boos. The engineering equivalents are equally inelegant and equally correct: roll back to the last stable version and stay there a while; over-provision the infrastructure even though it's wasteful, because hardware is cheaper than the spiral; put a human in the loop as a manual stopgap; turn the broken feature off. None of this fixes anything. That is the point. Stabilization is not the fix and must never masquerade as it: it buys the time in which the fix becomes possible. General Motors' 2009 bankruptcy is the canonical proof: the Chapter 11 automatic stay froze every creditor claim (a feature freeze with the force of law), debtor-in-possession financing bought the runway, and only then came the fundamental surgery, four brands killed, contracts renegotiated. GM emerged in forty days, fast because stabilization preceded restructuring. And Spirit Airlines is the proof of the converse: it emerged from Chapter 11 and was back in bankruptcy within five months, a stabilization that didn't actually create slack, which means the spiral just resumed where it left off. A cosmetic freeze defers the loop; it doesn't interrupt it.
Protect the capacity the fix will need. This is the 50% cap, and it's the step that separates buying time from wasting it. The slack you just manufactured will be consumed by something, the only question is whether you ring-fence it for the fundamental fix or let the backlog eat it. Then, and only then, the rewrite, done knowing worse-before-better still applies, but now from a base that can survive the dip.
Here's the practical version, small enough to actually do. Take your team's most painful system and write its 13-week model. Not in dollars, in whatever your spiral consumes. How many engineer-hours a week does the pager own? How fast is that number growing? At the current slope, what week does discretionary engineering capacity hit zero? That number exists whether or not you compute it; the only choice is whether you meet it on paper or in person.
Then look at whatever big fix your team is currently attempting (the migration, the rewrite, the re-architecture) and ask the sequencing question: are we doing this from a stabilized base, or instead of stabilizing? If the fix's hours are coming out of a deficit, you already know how it ends, because the dynamics have been published for twenty-five years: the dip arrives, the starved team retreats, the spiral resumes with interest. Stop. Declare it. Freeze what can be frozen, buy slack as crudely as necessary, ring-fence the engineering half, and then rebuild.
The crowd may boo. Take the check anyway. The iMac comes later, and it only comes at all because somebody refused to attempt it while the building was on fire.
You can't write the 13-week model of a spiral you can't measure.
The whole discipline turns on one number: how much of your capacity the fire already owns. For a fleet of agents that's just as real and far harder to see, the toil is spread across dozens of autonomous runs, and an agent's own after-the-fact summary won't tell you honestly where its hours actually went. Chain of Consciousness anchors every agent action to a verifiable external record, so “what is the pager actually consuming?” becomes a number you can compute from what happened, not a guess, which is the first move stabilization requires.
See a verified provenance chain · Hosted Chain of Consciousness
pip install chain-of-consciousness · npm install chain-of-consciousness