← Back to blog

We Gave 10 Instances the Same Ambiguous Spec and Measured Disagreement

A cheap, reproducible experiment that turns “write concrete specs” from opinion into a measured design rule.

Published May 2026 · 11 min read

Take a phrase you have probably written into a ticket sometime in the last month:

Handle errors appropriately.

Now imagine handing that line, untouched, to ten separate instances of the same coding model — same weights, same temperature, same hardware tier — and asking each to implement the system the phrase describes. What you get back is not ten copies of the same answer. You get something more like a survey.

Some of the instances catch the error, log it, and silently continue. Some catch it, log it, and re-raise. A few return an error code and let the caller decide. One panics and brings the process down on the theory that an unhandled error is a security event. Another writes a circuit breaker with exponential backoff because appropriately, in its training data, often meant resiliently. One — and you will not love this one — appends a Slack notification and sends it to a hardcoded channel name.

You did not get ten interpretations because the model is broken. You got ten interpretations because the phrase has ten interpretations, and the model is doing a perfectly fair job of estimating which one you meant.

This is an essay about turning that observation into a measurement instrument.


The instrument

The core idea is small. Hand the same ambiguous specification to N separate instances of a frontier model. Code each interpretation into discrete categories. Compute the Shannon entropy of the category distribution. That entropy is a number, in bits, that tells you exactly how much interpretive freedom your spec just gave the implementer.

For a fully unambiguous spec, all ten instances agree, the distribution is a single spike, and the entropy is zero bits. For a spec with two equally likely readings, you get something close to one bit. For a spec where all ten instances pick something different, you get the theoretical ceiling of about 3.32 bits across ten categories. Everything else lives on the gradient in between.

The number is useful because it is comparable. Handle errors appropriately might score 2.1 bits. Catch and log all uncaught exceptions to standard error before re-raising might score 0.3 bits. The difference is a measurement of how much your wording outsourced to the reader.

It is also useful because it does not require the spec author to know what the ambiguity is. The author does not have to anticipate the ten readings. The instances supply them. Daniel Berry’s long-running research program on requirements ambiguity at the University of Waterloo has made the case for decades that the most dangerous ambiguities are the ones nobody notices in review — the cases where each reader privately settles on an interpretation and assumes the others did the same. Berry calls these “nocent” ambiguities. An entropy measurement is a way to surface them mechanically, by watching ten parallel readers fail to agree.

That part is mostly a re-application of standard inter-rater reliability work to a new kind of rater. The interesting question, the one worth running an actual experiment to answer, is what the dial responds to.


Why temperature=0 isn’t enough

Before we get to the experiment, the noise floor. If you run the same prompt through the same model at temperature 0 and read the LLM literature, you are told that the output should be deterministic. It is not. GPU floating-point ordering, batching effects on the softmax computation, and API-level load balancing across replicas all introduce variation that survives even a fully greedy decode. The “Temperature=0 is a lie” essays that have circulated in 2024 and 2025 — Vincent Schmalbach on Medium, the Thinking Machines Lab post on defeating nondeterminism in LLM inference, the Keywords AI blog — are not rhetorical. They are describing how the silicon actually works.

For our purposes this means two things. First, ten instances at temperature 0 will still disagree on identical input, even when there is no ambiguity to disagree about. The measurement instrument has a noise floor. Second, that noise floor is something you can characterize and subtract. Run the protocol on a control set of fully unambiguous specs — return the sum of two integers passed as arguments named a and b — and the entropy you measure is the floor. Anything above it on a real spec is signal.

This is the unglamorous part of the methodology but it is the part that separates the measurement from the vibes. Without a noise-floor control, an entropy of 0.4 bits could be hardware. With one, you know whether the spec did that or the silicon did.


The headline hypothesis

The variable worth manipulating is the level of abstraction in the spec’s nouns. The cognitive-science literature has spent fifty years on the concreteness effect: Paivio’s dual-coding theory holds that concrete words activate both verbal and visual-sensory representations while abstract words rely primarily on the verbal channel, and the empirical consequence is faster lexical access for concrete words by something like 50-100 milliseconds. Contextual availability theory adds the complementary claim that abstract concepts are linked to a wider range of contexts and require more contextual scaffolding to retrieve a specific reading.

If those claims port from human readers to language models — and there are reasons to think they partly do, since the models are trained on the corpora that produced the effect — then specs written in abstract nouns should produce systematically more interpretive disagreement than specs written in concrete nouns, controlling for length. A reasonable prediction, grounded in the human-side effect size, is somewhere in the range of half a bit to a bit of additional entropy on the abstract side. Not dramatic, but real, and detectable across an N of 50 specs with 10 instances each.

That is the experiment in one sentence: write 50 paired specs, half with abstract nouns and half with concrete nouns, control for length and structure, run each through 10 instances at temperature 0, code the interpretations, compute entropy. Compare the two distributions. The literature is loud enough about the underlying effect that you should be very surprised if you do not see a difference.

But the more interesting result is what the literature predicts you will not see.


The specificity wrinkle

The concreteness/abstractness axis is what most of the prior work measures. A line of 2024-2025 research from Frontiers in Psychology and Wiley’s Mind & Language makes a quieter claim: concreteness and specificity are different variables, and specificity may be the one that actually drives interpretation.

The example that makes it click is to compare four phrases. Tool is concrete but vague. Hammer is concrete and specific. Justice is abstract and vague. Retributive justice is abstract but specific.

If concreteness alone drove interpretation, tool and hammer should both produce low entropy in our protocol. They do not. Tool is the kind of word that produces disagreement, because a tool can be a wrench, a script, a Python library, a generic noun standing in for any artifact at all. Hammer, in contrast, is concretely and specifically a hammer. Retributive justice, though abstract, has a circumscribed meaning that has been argued over for two centuries of legal philosophy. It is, paradoxically, less ambiguous than the concrete word tool.

This means the proper experiment has two variables, not one — concreteness and specificity — and the prediction that flips the conventional reading is that specificity will dominate. The biggest entropy reductions will come from specifying which of the candidate referents you mean, not from making the noun more sensory. Append the formatted error string to /var/log/app.err and continue execution is concrete and specific. Handle errors appropriately is mostly a failure of specificity, not of concreteness. The fix is not to make the spec touchable. The fix is to make it narrower.

The cognitive-science finding, in other words, predicts a result that is more useful to spec authors than the simpler concreteness hypothesis would have been. You do not need to make your specs sensory. You need to make them narrow. The advice is older — Joel Spolsky was writing it twenty years ago — but the cognitive instrumentation now agrees with the advice in a way that is measurable.


The dangerous case

The most important result the experiment can produce is not the headline effect. It is the shape of the disagreement at the 7-to-3 boundary.

A spec where all ten instances disagree — entropy near 3 bits — is a spec that will fail in review. The author will see ten wildly different interpretations and rewrite. A spec where all ten instances agree — entropy near 0 bits — is a spec that will succeed. The dangerous case is the one where seven instances pick interpretation A and three pick interpretation B. Entropy lands somewhere around 0.88 bits. The spec looks clear. The author reads the majority interpretation, sees their own intent reflected back, and ships. Thirty percent of the implementations that will eventually be written from that spec will be wrong, and the wrongness will be invisible until somebody downstream notices the system doing the wrong thing.

This is the nocent-ambiguity case in Berry’s framing. It is also the case where the LLM instrument earns its keep. A human reviewer will not catch a 7-to-3 ambiguity, because the reviewer privately settles on interpretation A like the seven majority instances and the disagreement never surfaces. The instrument catches it because the three dissenters are not subject to the same conformity pressure — they are independent samples of the conditional distribution over readings, and they report their disagreement honestly. If a spec produces a 7-to-3 split and you ship it without addressing the dissenters, you are gambling that thirty percent of your future implementers happen to be the agreeing seventy percent. The odds get worse if the spec lives long enough to be read by people who were not in the room when it was written.

The practical move is to set an entropy threshold — call it 0.5 bits — above which specs go back for revision before any implementation work begins. The number is arbitrary. The discipline is not.


Where this argument is weakest

Three concessions.

First, LLM instances are not independent in the way separate human readers are. They share weights, training data, and the same broad inductive biases. If all ten instances learned the same default reading from the same corpus, they will agree on that default even when humans would disagree. The instrument may therefore underestimate ambiguity systematically — by producing high agreement on specs that humans would actually fight about — rather than overestimate it. That is the opposite of the failure mode I have been describing. The fix is to test the instrument’s calibration against a human-coded gold standard on a subset of specs and report the agreement rate. If LLM entropy and human entropy diverge sharply, the instrument is measuring something narrower than human interpretation.

Second, the choice of model matters more than the methodology pretends. Claude 4.6, GPT-5, Gemini 3, and the open-weight Llama line will not produce identical entropy distributions on the same prompts. The 2026 arXiv paper Same Prompt, Different Outcomes (arXiv:2602.14349) made this point for data-analysis tasks; it will be at least as true for interpretation tasks. An entropy of 0.8 bits on Claude is not the same finding as 0.8 bits on GPT. If the goal is to compare specs against each other for a given team’s deployed model, the methodology is fine. If the goal is to produce an absolute ambiguity score that travels across vendors, the methodology is not there yet, and may never be.

Third, the relevant unit may not be the noun. Pragmatic ambiguity — appropriately, robustly, as needed — sits in adverbs and qualifiers, not the nouns the experiment manipulates. A spec like back up the database appropriately is concrete and specific in its nouns and pragmatically catastrophic in its adverb. A protocol that varies only nouns will not catch that class. The experiment’s value would be sharpened by a second arm that holds nouns fixed and varies the pragmatic qualifiers — the appropriately axis — independently.


What you can do with this on Monday

The whole methodology fits in a small loop you can write in an afternoon. Take any natural-language spec or ticket. Drop it into ten chat sessions with your team’s deployed model, fresh context each time, temperature 0. Ask each to produce a short implementation sketch — a function signature plus three to five lines of pseudocode is enough. Glance through the ten sketches. If they cluster on one approach, the spec is doing its job. If you see three obvious clusters at 5-3-2, you have a 7-to-3 problem hiding under the surface, and the spec needs to be rewritten before any human writes any code against it.

You do not need entropy in bits to get the value. You need ten parallel reads where you previously had one. The cognitive-science literature is suggesting, in a vocabulary that is genuinely a hundred years old, that the variance across independent readers of a piece of language is a measurable property of that piece of language. We can now measure it cheaply, and we can measure it on the specifications that will eventually decide what software actually does. That is a different kind of test than the tests we currently run.

The deeper move is to start treating specifications as artifacts that have a measurable property called interpretation bandwidth, and to optimize for low bandwidth the way we already optimize for, say, low cyclomatic complexity. Cyclomatic complexity is a number we accept as a code-health signal even though it cannot tell us whether the code is correct. Interpretation entropy is a number we can accept as a spec-health signal even though it cannot tell us whether the spec is right. Both are operating in the same regime — quantifying a property that used to be a matter of taste — and both let teams set thresholds, alert on regressions, and have an honest conversation about a thing that previously could only be argued.

The phrase handle errors appropriately will, on most teams’ models, score somewhere north of 1.5 bits. That is a Monday morning measurement. Once you have the number, the only remaining question is whether you ship the spec or rewrite it. The number, for once, is the one giving you the answer.

Which interpretation did the agent actually pick?

The entropy measurement reveals that a spec has multiple readings. The next question, when an agent ships the implementation, is which reading it chose — and whether the choice is recoverable when the system does the wrong thing six months later. Chain of Consciousness anchors every agent action to a verifiable external record, so “the agent interpreted ‘appropriately’ as ‘swallow and continue’” is not a reconstruction in a postmortem. It’s a query against the chain.

pip install chain-of-consciousness · npm install chain-of-consciousness
Hosted Chain of Consciousness → · See a verified provenance chain