← Back to blog

Cohort Analysis: Bigger Is Not Better

The aggregate metric hides whether you're improving or just outrunning churn. A rising number can sit on top of a business where every cohort is dying.

Published June 2026 · 9 min read

Picture a board meeting. The founder puts up the revenue chart and it is beautiful: up and to the right, 40% growth year over year, the kind of line that gets a company its next round. Everyone is pleased. Then the new investor running diligence asks for one specific cut of the data that isn't on the slide: revenue retention, grouped by the month each customer signed up. The analyst pulls it together, and the picture inverts. Lined up by vintage, every single monthly cohort has lost more than half its revenue by the time it's a year old, and the curves are getting steeper with each passing quarter. The newest cohorts are leaking fastest of all.

The 40% growth was real. It just wasn't what everyone in the room thought it was. It wasn't the business getting better; it was new customers pouring into the top of the bucket fast enough to hide the fact that the bucket was leaking worse every quarter. The aggregate revenue line said winning. The cohort curves said dying. And here is the unsettling part: both were completely true at the same time. That gap, between a number that's going up and a business that's getting worse, is one of the most expensive blind spots in technology, and the instrument that closes it is called cohort analysis.

Why the aggregate is built to lie

The mechanism is almost embarrassingly simple once you see it, which is exactly why it fools so many smart people. Any aggregate metric (total revenue, overall churn, monthly active users, fleet-wide uptime) is a single number computed across your entire current population. But that population is a blend of vintages: brand-new arrivals mixed with grizzled veterans mixed with everyone in between. When you're growing, the mix is dominated by the newcomers, who haven't had time to leave yet. So the aggregate is really a moving average over a constantly changing crowd, and it tracks the thing you're adding far more than the thing you're keeping.

This is why a company with genuinely terrible retention can post a gorgeous, rising aggregate revenue curve indefinitely, as long as it acquires new customers faster than the old ones bleed out. Stack enough fresh, healthy cohorts on top and they bury the cratering ones underneath. The veterans are quietly leaving in droves; you can't see it, because this quarter's flood of newcomers more than fills the hole, for now. It's the leaky bucket: pour water in fast enough and the bucket looks full even as the leak widens. The trouble is that the leak compounds and the acquisition gets more expensive, so the day the inflow slows even slightly, the whole thing caves in, and it caves in fast, because the rot was there the entire time. You just couldn't see it in the one number you were watching.

The deepest version of this isn't a quirk; it's a named statistical trap, and it can do something worse than hide the truth. It can reverse it.

Simpson's paradox, and the aggregate that accused the wrong party

In 1973, the University of California, Berkeley looked at its graduate admissions and saw what appeared to be flagrant gender bias. In aggregate, about 44% of the 8,442 men who applied were admitted, versus only 35% of the 4,321 women, a gap far too large to be chance, and exactly the kind of number that ends up in a lawsuit. On the face of it, the university was discriminating against women.

Then three researchers, Peter Bickel, Eugene Hammel, and J. William O'Connell, did the thing the aggregate doesn't do. They split the data by department, and the bias didn't just shrink. It flipped. Within the individual departments, women were admitted at rates equal to or higher than men; in four of the six largest departments, a woman was actually more likely to be admitted than a man. Their findings, published in Science in 1975, became the textbook case of what statisticians call Simpson's paradox: a trend that holds in the aggregate can vanish or reverse inside every subgroup, because a lurking variable is confounding the whole picture. At Berkeley, that variable was which department people applied to. Women disproportionately applied to competitive departments with low acceptance rates for everyone; men applied more to departments that admitted most applicants. The aggregate wasn't measuring bias. It was measuring application patterns, and it had pinned the blame on exactly the wrong thing.

Sit with how dangerous that is. The aggregate didn't merely fail to show the truth; it confidently asserted the opposite of the truth, with a big statistically-significant number behind it. And the only thing standing between Berkeley and a completely wrong conclusion was someone insisting on disaggregating by the confounding variable.

Cohort analysis is precisely this move, applied to time. The lurking variable in your rising dashboard is vintage (when each customer, service, or hire entered) and it plays the exact role Berkeley's “department” played. Slice the metric by cohort and you remove the confound: you stop comparing a population that's mostly newcomers this year to one that was mostly veterans last year, and you start comparing each group to its own past, on its own clock. The aggregate de-confounds nothing. The cohort de-confounds everything.

An old idea from a field that counts the dead

If this feels like a clever SaaS-dashboard trick, the discipline is in fact far older, and comes from a field with no revenue at all: demography, the study of populations. Demographers have always had to distinguish two fundamentally different ways of reading a number across time. A period measure takes a snapshot of everyone alive right now (this year's death rates across all ages, say) which is a cross-section of wildly different vintages all at once. A cohort measure instead follows a single group who entered together (everyone born in 1950) through their whole lives, watching what actually happens to them. The sociologist Norman Ryder made this the centerpiece of a landmark 1965 paper, “The Cohort as a Concept in the Study of Social Change,” arguing that you cannot understand how a society is changing by staring at period snapshots, because each one blends generations whose experiences are nothing alike. Epidemiologists formalized the same insight into age-period-cohort analysis, which painstakingly separates three tangled time effects: how old you are, what's happening to everyone right now, and which cohort you belong to.

Even the word carries the lineage: a cohort was a unit of the Roman legion, a body of soldiers who enlisted and marched and fought as one. The defining feature was always that they moved through time together. That is the whole idea. Your January signups are a cohort. The microservices you launched in last spring's platform push are a cohort. The engineers from this year's new-grad class are a cohort. Each one entered together and is moving through its own life, and the only honest way to ask whether things are getting better is to watch a cohort age, not to read a period snapshot that smears every vintage into one lying average.

“Bigger” and “better” are different questions

The reason this matters so much is that the aggregate and the cohort answer two genuinely different questions, and companies constantly mistake one for the other. The aggregate answers “are we bigger?” The cohort answers “are we better?” You can answer an emphatic yes to the first while the true answer to the second is no, and never notice the difference until the growth stalls.

The clearest lens on this is what SaaS operators call net revenue retention: how much revenue a cohort generates over time, counting both the customers who leave and the ones who expand their spending. Gross retention can never exceed 100%; people only churn down. But net retention has no ceiling, because expansion can outrun churn. A cohort with net retention above 100% literally grows itself over its lifetime: even as some customers leave, the ones who stay spend enough more to push the whole cohort's revenue up year after year. That's the “smile curve”: retention dips as the casual signups fall away, then stabilizes around a loyal core and bends back upward. That business compounds; it needs less acquisition over time, not more, because its existing customers are doing the growing.

A cohort with net retention below 100% is the opposite: it shrinks every year, and the company has to acquire ever harder just to stay in place. Two companies can post the identical headline growth rate (one at 40% with cohorts quietly losing 30% of their value, the other at 30% with cohorts compounding at 110% net retention) and they are not the same company at all. The first is on a treadmill that speeds up under it; the second is climbing a hill that gets easier. The aggregate growth rate cannot tell them apart. A flat 5% monthly churn rate looks exactly the same in a thriving smile-curve business and a dying dead-end one. Only the cohort triangle, each vintage's curve laid out over its own life, reveals which one you actually have.

The same illusion runs your systems, not just your revenue

Here's where this stops being a finance lesson, because the confound isn't about money. It's about any metric computed across a population of mixed vintages, and engineering organizations are full of them.

Reliability. Your fleet-wide uptime looks healthy, holding its SLO quarter after quarter. But slice availability by service launch vintage and you may find every cohort of services slowly degrading as it ages (dependencies rotting, tech debt accreting, the original owners moving on) while the aggregate stays green only because each quarter's batch of pristine new services dilutes the decay underneath. You are not getting more reliable. You are adding healthy services faster than your old ones rot, which is the leaky bucket wearing an SRE dashboard.

Onboarding. “Team productivity” is up and the VP is taking credit for the improved onboarding program. Slice it by hire cohort, though, and ask how long each class of new engineers took to reach full ramp. If this year's cohort takes exactly as long as last year's, your onboarding didn't improve at all; you just hired more people, and the aggregate output rose on headcount. The thing you were measuring (“are we onboarding better?”) was answered by a number that only knew “are we bigger?”

Codebase health. The aggregate defect density or complexity score is stable, so the code is fine. Or: slice by module vintage, and watch the modules written three years ago steadily decay while this year's clean new modules hold the average up. The codebase isn't healthy; it's bimodal, and the mean is hiding a rotting old core behind a shiny new shell, until the day a critical change has to go through that old core.

In every one of these, the aggregate can rise while every cohort falls, and it is not a paradox; it's arithmetic. New healthy cohorts stacked on top of old cratering ones produce a rising average, every time. And in every one, you will believe you are improving while you are actually running the acquisition treadmill, right up until the inflow of new vintages can't cover the decay anymore.

The discipline, and its honest limits

The practical rule is small enough to adopt this week: for anything that has a vintage (customers, deploys, services, features, hires, modules) don't trust the aggregate; slice the metric by cohort and watch each cohort's curve over its own life. And carry one diagnostic question into every dashboard review: is this number rising because each cohort is genuinely better, or because new cohorts are diluting the decay of the old ones? If you cannot answer that, you do not actually know whether you are improving. You only know you are bigger.

Two honest caveats keep this from becoming its own kind of overconfidence. The first is that the cohorts you most want to judge are the ones you can judge least. The newest cohort, the one that just experienced your big improvement, is also the youngest, with the least lifetime to observe, so you genuinely cannot yet know whether this quarter's change worked; you have to let the cohort age, or lean on early leading indicators and hold your conclusions loosely. Impatience here re-creates the very error cohorts exist to prevent. The second is that cohort analysis localizes a problem without diagnosing it. It will tell you with brutal clarity that your 2023-vintage services are decaying, but not why, or what to do. It de-confounds the vintage; the root cause is still your job, the way a map of where the deaths cluster still leaves you to find the poisoned well.

But that is a feature, not a flaw. The aggregate hands you a single comforting number and an unearned conclusion. The cohort hands you the truth, vintage by vintage, and an honest question. One of them lets you believe you're winning while every cohort decays beneath you. The other shows you the decay while there's still time to fix it. They are very different instruments, and as the founder in that board meeting learned the expensive way, they are very different companies, wearing the exact same dashboard.


Sources

  1. P. J. Bickel, E. A. Hammel & J. W. O'Connell, “Sex Bias in Graduate Admissions: Data from Berkeley,” Science (1975): the 1973 UC Berkeley data, about 44% of 8,442 men admitted vs 35% of 4,321 women in aggregate, with the bias reversing within departments (a woman more likely admitted than a man in four of the six largest). The canonical Simpson's-paradox case.
  2. Norman B. Ryder, “The Cohort as a Concept in the Study of Social Change,” American Sociological Review (1965). Period vs. cohort measures and age-period-cohort analysis are standard in demography and epidemiology.
  3. Net revenue retention, gross retention, and the cohort “smile curve” are standard SaaS metrics; gross retention is capped at 100%, net retention is not.
  4. “Cohort” derives from the Roman legion's unit of soldiers who served together.

You can only slice by vintage if you kept an honest record of each vintage.

Cohort analysis de-confounds time only when you can reconstruct what each cohort actually did over its whole life, deploy by deploy, action by action. The moment the population is autonomous agents, a fleet-wide success rate hides exactly the same thing a revenue line does: this quarter's fresh agents propping up the average while last year's vintage quietly degrades. You cannot see that in an aggregate, and you cannot trust a per-agent self-report to reconstruct it. Chain of Consciousness anchors every agent action to a verifiable external record, so the cohort triangle is built on what happened, not on what the dashboard chose to remember.

See a verified provenance chain · Hosted Chain of Consciousness

pip install chain-of-consciousness  ·  npm install chain-of-consciousness