The Diversity Prediction Theorem Is a Spec for Mixture-of-Experts

Scott Page wrote the spec in 2007. MoE engineers have been implementing it for thirty-five years without reading the documentation.

Published May 2026 · 9 min read

You wire in your new expert. It's a beautiful piece of work — fine-tuned on your hardest queries, evaluated on the toughest slice of your benchmark, and individually the most accurate expert in the pool. You add it to your eight-expert mixture-of-experts model, retrain the gating network, and run the full eval.

System loss goes up.

Every per-expert metric you check says the addition was right. Per-token accuracy improved. The new expert's specialist domain looks pristine. The gating network is even routing to it correctly. And yet the ensemble — the system you actually ship — is worse than before you added it.

There is a theorem that predicts exactly when this will happen. It was published in 2007 by a political scientist named Scott Page, in a book about why diverse committees make better decisions. Nobody in machine learning cites it.

That's a mistake worth fixing, because Page's theorem is not an analogy for mixture-of-experts. It is the specification.

The equation

The Diversity Prediction Theorem states:

Collective Error = Average Individual Error − Prediction Diversity

In Page's notation: E = M − D. The squared error of the group's aggregate prediction equals the average squared error of the individual predictions minus the variance of those predictions across the group.

This is not a heuristic. It is not "in expectation" or "as the sample size grows." It is an algebraic identity — it holds, exactly, for any finite set of predictions and any true value, by basic variance bookkeeping. The same machinery gives you the bias-variance decomposition every undergraduate ML class teaches; Page's contribution was to point out the same identity, in the same form, for collective decision making.

Three knobs:

E is what you ship — the loss of the aggregated, gated, weighted output.
M is the average individual loss, which is what every per-expert benchmark you have measures.
D is the diversity of the individual outputs — how much the experts disagree with each other on the same inputs.

The two terms move E in opposite directions. You can lower E by making each expert better (drop M). You can also lower E by making your experts disagree more (raise D). And — this is the part that matters — they trade off against each other.

That trade-off is the entire reason mixture-of-experts works, the entire reason ensembles work, the entire reason a committee of three middling experts can outperform any single specialist, and the entire reason your perfectly-fine-tuned MoE can quietly degrade into a single-expert system without producing any error signal you can see.

What the theorem says about your MoE

The mapping is one-to-one:

Diversity Prediction Theorem	Mixture-of-Experts
Individual predictor	Expert network
Average individual error (M)	Average per-expert loss
Prediction diversity (D)	Variance of expert outputs on the same input
Aggregate prediction	Gating-network-weighted combination
Collective error (E)	MoE system loss

The mixture-of-experts architecture in the Jacobs, Jordan, Nowlan, and Hinton paper from 1991 (Adaptive Mixtures of Local Experts, Neural Computation 3:79–87) was designed around the implicit insight that competing experts produce uncorrelated errors and thus low collective loss. Page's theorem, sixteen years later, made that explicit and quantitative: the gain you get from combining experts is exactly the D term in the formula, and you can measure it.

Recent MoE research has rediscovered the same fact without citing Page. "Combining diverse experts is essential for achieving significant accuracy improvements, as the less correlated the errors among experts, the stronger the ensemble," a 2023 paper on diversifying MoE representations stated (arXiv 2310.09762). A 2024 survey on MoE in LLMs (A Closer Look into Mixture-of-Experts in Large Language Models, arXiv 2406.18219, accepted at NAACL 2025) measured the parametric and behavioral correlation between experts in Mixtral 8x7B, DeepSeekMoE, and Grok-1 — and found Mixtral's experts more correlated than the others, with the explicit implication that Mixtral's collective gain is smaller than its individual experts' quality would predict.

If you read those papers without Page's theorem in your hand, the conclusions feel like a collection of observations. With Page's theorem, they are one observation: the D term in Mixtral is lower than in DeepSeekMoE, so Mixtral's E sits closer to its M than DeepSeekMoE's E does. Same formula, different system, different number. The theorem is the spec; the papers are the measurement.

Why your new expert hurt the system

Back to the new specialist you wired in.

You added a highly accurate expert. Per-token accuracy improved, so the new expert's contribution to M is positive: the average individual error dropped slightly.

But the new expert was fine-tuned on the same dataset as the existing pool. Its predictions correlate with the existing experts'. When the gating network routes a hard query to your new specialist, the answer it produces resembles the answer the other experts would have produced. The variance of expert outputs on the same input — D — dropped substantially.

In the equation E = M − D, both terms moved in directions that look like progress: M down, D down. E is what's left over. If the drop in D exceeds the drop in M, E goes up. Which is exactly what happened on your dashboard.

This is the design rule Page's theorem hands you, free, in two sentences:

A more-accurate but correlated expert can raise collective error. A less-accurate but diverse expert can lower it.

The composition question for any new expert is not "is it more accurate?" It is "does it make different mistakes from the existing pool?" The theorem gives you a number. You can measure it. You can optimize against it. You can build CI checks on it. You can reject ensemble additions on it the way you reject test regressions.

The silent collapse

The same algebra predicts a far nastier failure mode, one that doesn't surface in any dashboard you currently have.

Fine-tuning every expert on the same corpus — the obvious move for a production MoE, the default most teams take — drives D toward zero.

The mechanism: every expert is trained against the same loss, on the same data distribution, with gradients pointing at the same minimum. Each expert's parameters drift toward the same neighborhood in weight space. Their predictions on any given input converge. The variance of their outputs — D — shrinks monotonically as fine-tuning continues.

Plug D = 0 into the theorem. E = M − 0 = M. Your eight-expert MoE has degenerated, exactly and provably, to a single average expert. The ensemble has collapsed. The gating network is doing nothing. Inference is eight times slower than it needs to be. And the loss curve looks fine — every individual expert is still accurate, M is in a healthy range, and your aggregated E tracks M because there's no diversity left to subtract.

This is what economists and social scientists call an information cascade. Abhijit Banerjee published the canonical model in 1992 (A Simple Model of Herd Behavior, Quarterly Journal of Economics 107(3):797–817): rational agents, observing the same evidence, converge on the same conclusion regardless of their private information. Each individual update is locally correct. The collective outcome is uninformative.

A fine-tuned MoE is a Banerjee cascade in slow motion. Each expert's gradient step is locally correct — it reduces that expert's individual loss. But the collective evidence the experts are responding to is the same data, the same loss, the same optimizer. The cascade is the inevitable consequence of having every expert see the same upstream.

The 2024 MoE survey caught this empirically. Mixtral, which started from a shared Mistral 7B base and was effectively fine-tuned into an MoE configuration, shows stronger correlations between expert parameters and behaviors than DeepSeekMoE and Grok-1, which were trained from scratch as MoEs (arXiv 2406.18219). The training methodology — shared base vs. independent initialization — directly determines the D term, and D determines how much benefit the MoE actually produces.

This is why the train-then-merge literature works. Charging straight at independence at training time, with separately-initialized experts on partially-disjoint corpora, preserves the variance Page's theorem says you need. The result, in the words of one such paper, is that the strategy "captures the diverse generalization behaviors of individual experts and avoids potentially harmful regularizations introduced by joint training" (arXiv 2301.03962). Page's theorem tells you why the strategy works. It also tells you the cost of skipping it: the D term is what you give up.

What you optimize when you optimize for diversity

The fix is not exotic. The shape of the fix is the shape of the theorem.

Different upstream. Pre-train each expert on a distinct slice of the corpus, or at minimum a different shuffle and a different random seed. The information cascade starts the moment you share an upstream; the cascade is what you are trying to prevent. Train-then-merge is the explicit version of this.
Different objectives. A specialist tuned on a different downstream loss — say, one expert optimized for factuality and another for style — produces different mistakes. Different mistakes are D.
A diversity term in the loss. The decorrelation regularizers proposed in the 2310.09762 paper and elsewhere directly add the D term to the training objective. You do not have to discover the right gradient for diversity; you can write it down. Page's theorem already wrote it.
Acceptance criteria on diversity. Before promoting a new expert into the pool, evaluate it on a diversity benchmark — measure the variance of its predictions against the existing experts on a held-out set. If diversity drops, reject the expert even if its individual accuracy improved. That's the design rule from earlier in this essay, operationalized.
Diversity dashboards. The per-expert metrics every team tracks today report M. Almost no team tracks D. Adding a single panel for variance-of-expert-outputs alongside per-expert loss would catch the silent-collapse failure mode in production. The cost is one line of telemetry; the value is detection of a failure mode that currently produces zero error signal.

Each step lowers E by raising D rather than by lowering M. Production MoE programs spend almost all of their attention on the first lever and almost none on the second, which is why the failure mode that consumes the second lever is silent.

What the political scientist understood that we forgot

Page was not, in 2007, thinking about neural networks. He was thinking about juries, peer-review committees, prediction markets, and the standing question of why a roomful of merely-competent people sometimes outperforms any single genius in the room. His answer in The Difference (Princeton University Press, 2007) was simple: the merely-competent people make different mistakes, and the different mistakes partially cancel in aggregation.

Sixteen years before Page's book, Jacobs, Jordan, Nowlan, and Hinton built the first mixture-of-experts architecture on what was, in retrospect, the same insight: a competition between local experts produces specialists whose errors don't correlate. They did not write the theorem down; they engineered around it.

Page wrote it down. He did not know it was the spec for an ML architecture. The ML architecture's engineers did not know there was a closed-form theorem already published in their literature, just not in their venues.

This is the cross-domain pattern worth holding onto. The mathematical apparatus you need to design well-behaved AI systems often already exists, in a non-ML field, written by someone who was solving an apparently-unrelated problem. The Diversity Prediction Theorem is one example; the bias-variance decomposition Page's theorem rhymes with is another; the impossibility theorems of social choice are a third; portfolio theory's variance-covariance matrices are a fourth. Each offers a closed-form constraint on how a collection of decision-makers can do better than its members, and each has been rediscovered, partially, by ML practitioners working without the theorem in hand.

The practical insight to take into your next MoE design review is this:

If you can measure how much your experts disagree on the same input, you have a number that predicts your system loss as exactly as the average individual loss does. Track both. Optimize both. The gain you can squeeze from disagreement is, in many production MoEs, larger than the gain you can squeeze from individual accuracy — and unlike accuracy, it has no obvious dashboard, so it disappears first.

Scott Page wrote the spec. Mixture-of-experts engineers have been implementing it for thirty-five years without reading the documentation. Reading the documentation does not change what the code has to do — but it tells you, for the first time, where the bug is.

D has no dashboard because trust doesn't have a dashboard either.

Page's theorem only helps you if both terms — per-expert quality M *and* inter-expert diversity D — are actually instrumented. Most fleets track the first because it falls out of per-agent telemetry by default; the second disappears because no one builds it. The Agent Trust Stack ships both as primitives: per-agent quality signals on one side, cross-agent variance and disagreement on the other, with the identity (E = M − D) computable from the same stream. The "single panel for D" the essay calls for is a query, not a project, once your fleet has the stack underneath it.

pip install agent-trust-stack · npm install agent-trust-stack
vibeagentmaking.com → · See the stack in action

← Back to all posts