Combine your leading indicators into one number. Then weight the one that actually predicts.
In 1968, a 26-year-old assistant professor of finance at NYU named Edward Altman published a formula that should not have worked as well as it did, and that is still in active use around the world fifty-eight years later. The field he was working in had a problem it didn't quite recognize as a problem: to judge whether a company was headed for bankruptcy, analysts stared at a handful of financial ratios (the current ratio, the debt ratio, profitability, asset turnover) one at a time, and argued endlessly about which one mattered most. Everyone had a favorite. Everyone weighted them by gut.
Altman did something different. He took 66 publicly traded manufacturers, 33 that had filed for bankruptcy and 33 matched companies that hadn't, and fed them into a statistical technique called multiple discriminant analysis, which finds the combination of variables that best separates two groups. He let the data decide how much each ratio mattered. Out came a single number, the Z-Score:
Z = 1.2(working capital/assets) + 1.4(retained earnings/assets) + 3.3(EBIT/assets) + 0.6(equity/liabilities) + 1.0(sales/assets)
with three zones: above 2.99, the company is safe; below 1.81, it is in distress and likely heading for bankruptcy; in between is a grey zone. And it worked, startlingly: it correctly flagged failing companies 95% of the time one year before they went under, 72% two years out, and 52% three years out, degrading gracefully the further into the future it reached, the way an honest leading indicator should. Fifty-eight years later, with re-fitted variants for private firms and non-manufacturers, it is still taught, still used by lenders and auditors and investors, still predicting corporate death.
The Z-Score's design encodes two disciplines that engineering health-monitoring almost universally lacks. The first is obvious once you see it. The second is the one that actually matters, and the one we get wrong.
The first thing Altman did was combine. Distress, he understood, is a multi-factor condition. No single ratio captures it: a company can be liquid but unprofitable, profitable but drowning in debt, growing fast but burning cash. Watching five ratios separately, a human analyst is doing pattern-recognition across five graphs in their head, and humans are bad at that, especially under stress, especially when the graphs disagree. The composite does the integration for you: it collapses the five-dimensional question "is this company sick?" into a single number on a line with a threshold. The at-risk company becomes obvious instead of arguable.
This "combine into one number" move is one of the most reliable wins in the history of applied measurement, and it shows up wherever the stakes are high and the signals are many. In 1952, sixteen years before Altman, the anesthesiologist Virginia Apgar faced newborns being assessed by clinical gestalt, with at-risk babies missed in the chaos of the delivery room. She combined five signs (heart rate, breathing, muscle tone, reflex, color), scored each 0–2, and summed them into a single 0–10 Apgar score. A number any nurse could compute in a minute, that made the baby needing immediate help unmissable, and it is credited with saving countless lives precisely by replacing "the experienced clinician's overall impression" with one legible, comparable number. The same logic runs the FICO score (1989): a weighted composite of your credit factors into a single number that decides whether you get the loan. Medicine, consumer lending, corporate finance: all independently discovered that a calibrated composite beats a wall of separate gauges plus expert gut.
Engineering, by and large, has not. We build the pre-Altman thing: a dashboard. Twelve gauges (CPU, memory, error rate, p99 latency, deploy-failure rate, queue depth, churn, saturation, the rest), each on its own graph, each watched separately, no composite at all. And we are proud of it, because it looks like rigor. It is, in fact, the exact situation Altman's field was in before 1968: a pile of individual ratios and a human trying to integrate them by eye. A dashboard of twelve gauges is a Rorschach test. You see in it whatever you're primed to see, and what you're primed to see, almost always, is whichever graph is reddest right now. That is recency-and-salience bias wearing the costume of monitoring: you react to the loud signal, not the predictive one, and the quiet ratio that actually leads the failure, the EBIT/assets of your system, sits in the corner, unweighted and unwatched, because it isn't red yet.
Here is the part that separates a real prediction from a comforting one, and it is the part almost everyone skips.
Altman's genius was not picking those five ratios. Other researchers had used the same ratios; the ratios were the easy part. His genius was that he did not choose the weights. He didn't sit down and decide that profitability should count twice as much as liquidity because that felt right. He ran discriminant analysis against real bankruptcies and real survivals, and the data assigned the weights (1.2, 1.4, 3.3, 0.6, 1.0), each number a measurement of how powerfully that ratio actually separated the dead companies from the living ones. The weights aren't an opinion. They're a finding.
And look at the finding. The heaviest weight, by a wide margin, is 3.3 on EBIT/assets: operating earning power, the cash-generation core. The data said, across 66 real companies, that a company's ability to generate operating earnings from its assets is the single thing whose failure most reliably precedes its death. Not its debt load, not its liquidity: its cash engine. That's not something Altman knew going in and encoded; it's something the regression taught him, and the weight is the lesson written down. The weights are simultaneously the prediction and a diagnosis of what to watch.
This is the discipline engineering health scores violate constantly. When teams do build a composite "health score," the weights are almost always guessed. Someone in a meeting decides errors are 50%, latency 30%, CPU 20%, and everyone nods, and the number ships. But a guessed-weight composite is not a prediction: it's a Rorschach test with extra steps, a way of laundering the same gut-feeling into a single number that now looks objective. If your weights came from a meeting instead of from your data, your "health score" tells you exactly as much as the meeting did, which is to say, what people already believed.
So build a real one. The translation is direct, and the method is Altman's, almost line for line.
First, choose your candidate leading indicators: the ratios of your domain. For a service: error-budget burn rate, deploy-failure rate, p99 latency trend, saturation, change-fail rate, rollback frequency, dependency-error rate. For a team's health (because teams fail too, and the signals lead): on-call load, time-in-grey on the SLO, attrition signals, review-latency creep, the talent-flight tells. Pick the ones that plausibly relate to failure, the way Altman picked ratios that plausibly relate to distress. This is the input, and garbage in is still garbage out: the composite is only as good as the indicators you feed it.
Second, and this is the whole ballgame, derive the weights from your real incident history. You have it: a record of past outages, SLO breaches, the times things went badly. Treat each as Altman treated a bankruptcy. Run a regression, even a plain logistic regression, of "did this state precede an incident?" against your candidate indicators, over your own history. Let your data assign the weights. The output is a composite tuned to your system's actual failure modes, and the weights themselves are a revelation: you will learn which indicator most reliably leads your catastrophes. Maybe it's deploy-failure rate. Maybe it's error-budget burn. Maybe, for the team, it's the attrition signal that quietly precedes the cascade of outages a depleted team can't prevent. Whatever it is, that's your EBIT/assets, the 3.3 factor, and you should weight it highest and watch it hardest, because the data, not the meeting, said it leads.
The payoff is often a genuine surprise, and the surprise is the point. A team I'd bet money on would guess, in the meeting, that p99 latency and error rate are the big leading indicators, because those are the graphs that scream during an outage. But latency and error rate are frequently lagging: they spike when the failure is already happening. Run the regression honestly against the history and you often find the heavy weight lands somewhere quieter and earlier: rollback frequency creeping up over a fortnight, the error budget burning a little faster each week, change-fail rate drifting, deploy sizes growing. Those are the boring graphs nobody watches because they're never the reddest, and they're exactly the ones that lead, the way EBIT/assets quietly led bankruptcy while everyone stared at the debt. The regression doesn't just give you a number; it hands you a map of your system's failure physics, and that map almost never matches the meeting's intuition. That mismatch is the whole value: it's the difference between watching what's alarming and watching what's predictive.
Even a crude data-derived weighting beats a careful guess, for the same reason Altman's simple linear model beat decades of expert ratio-arguing: the model is calibrated against reality and the experts were calibrated against each other. A dashboard of twelve gauges is a Rorschach test; a regression against your incident history is a prediction. The difference between them is not sophistication: it's whether the weights came from data or from which graph is reddest today.
Three caveats, because a tool used carelessly is worse than no tool.
Re-fit the weights as the system changes. Altman's 1968 coefficients were calibrated on 1960s manufacturers; applied blindly to a modern software company or an emerging-market firm they mislead, which is exactly why Altman himself published re-fitted variants (the Z'-Score for private firms, the Z''-Score for non-manufacturers). Your system evolves; a weight that predicted last year's failures may not predict next year's. Re-regress periodically, or you are confidently forecasting a system that no longer exists.
Keep the individual gauges, for diagnosis, not prediction. The composite is the alarm: it tells you the system is sick, sooner and more reliably than any single graph. But it cannot tell you why: a Z-Score of 1.4 says "distress," not "your debt is the problem." When the composite fires, you go back to the underlying indicators to diagnose the cause. The composite predicts; the dashboard explains. You need both, in that order.
The composite is only as good as its inputs and its calibration data. If your incident history is thin or your candidate indicators don't actually relate to failure, the regression will happily fit noise. Altman had the advantage of a clean binary outcome (bankrupt or not) and a balanced sample; engineering failures are messier. Treat the early weights as a hypothesis to be tested against future incidents, not as gospel: the model earns trust by predicting failures it wasn't trained on, the same standard Altman's faced.
Look at your monitoring. If it is a wall of disconnected gauges that a human integrates by eye, and if the integration in practice amounts to "react to whatever's reddest", you are running the pre-1968 playbook. Do the two things Altman did. Combine: define a single composite distress index for your system or team, so the at-risk state becomes one legible number instead of a twelve-graph judgment call. And calibrate: do not guess the weights: regress your composite against your own incident history, let the data tell you which indicator most reliably leads your failures, and weight that one highest. Re-fit it as you change. Keep the gauges for the autopsy.
Edward Altman's real gift, fifty-eight years on, was never the five ratios. It was the insistence that the weights are an empirical question with a data-shaped answer: that the difference between a number that describes your present and a number that predicts your future is whether you weighted the factors by what actually preceded death, or by which one felt important in the meeting. Combine your signals into one number. Then weight the one that actually predicts. The data knows which one it is; you only have to ask it.
Whether to trust an agent is a multi-factor condition. Combine the signals; don't watch them separately.
Deciding whether to trust an agent is Altman's problem exactly: no single gauge captures it. Its provenance, its track record, its ratings, each is one ratio, and watching them on separate dashboards means reacting to whichever is loudest, not which one predicts. The Agent Trust Stack is the composite: it combines the provenance record (Chain of Consciousness) and the portable reputation (the Agent Rating Protocol) into one trust assessment, so the at-risk agent is a legible signal instead of a twelve-graph judgment call. Combine the signals, then weight the one that actually predicts.
Verify an agent's trust signals
pip install agent-trust-stack · npm install chain-of-consciousness agent-rating-protocol