Softmax Is the Boltzmann Distribution

Your model's “temperature” is 1870s thermodynamics. The equation doesn't know which department it's filed under.

Published June 2026 · 12 min read

On October 8, 2024, the Royal Swedish Academy of Sciences awarded the Nobel Prize in Physics to John Hopfield and Geoffrey Hinton, and a meaningful slice of the physics community was annoyed. The citation read “for foundational discoveries and inventions that enable machine learning with artificial neural networks,” which is to say the most prestigious prize in physics had gone, in part, to Geoffrey Hinton, a man the world knows as the “godfather of AI,” a cognitive scientist and computer scientist who has spent his career building neural networks, not particle detectors. The grumbling was public and pointed: this work, critics said, has “only a lateral relationship to physics”; it would be better filed under computer science; a Nobel in Physics for machine learning blurs a boundary that ought to stay sharp. There was a fair version of the complaint, too, that earlier neural-network pioneers like Alexey Ivakhnenko and Shun-ichi Amari had gone unrecognized for decades.

But the disciplinary objection (this isn't really physics) is the most interesting thing about the prize, because the physicists making it were, without quite intending to, confirming the deepest fact about how modern AI works. The reason a machine-learning prize could even plausibly be a physics prize is not committee politics. It's that the equation sitting at the output of essentially every neural network on Earth, the one your model runs on every single forward pass, is not like a physics equation. It is one, written down in the 1870s, to explain why gases are warm.

The same equation, twice

Here is the function in question. A language model, having processed your prompt, produces a vector of raw scores (logits), one per possible next token. To turn those scores into a probability distribution, it applies the softmax: the probability of token i is exp(z_i) / Σ_j exp(z_j), the exponential of its score divided by the sum of the exponentials of all of them. Add a temperature knob and it becomes exp(z_i/T) / Σ_j exp(z_j/T). This is the last step before sampling, and most practitioners think of it as a normalization detail.

Now put it next to the Boltzmann distribution, the centerpiece of statistical mechanics, which gives the probability that a system in thermal equilibrium occupies a state of energy E_i: P_i = (1/Z) · exp(−E_i / k_B T), where Z = Σ_j exp(−E_j / k_B T) is the sum over all states. Line the two up and translate. Your logits are negative energies: a high logit is a low energy, the state the system “wants” to be in. The denominator you dismiss as normalization is what the physicist Josiah Willard Gibbs named the partition function, Z. And your “temperature” is Boltzmann's T, with the physical constant k_B absorbed because a machine-learning temperature is a dimensionless number rather than something measured in kelvin. Make the substitution E_i = −z_i and the two equations are not similar. They are character-for-character the same.

A precise caveat, because precision is the whole point: this is an identity of form, not a claim that your GPU is literally hot or your logits are literally in joules. The same probability distribution governs both systems; the physical interpretation (energy, heat, equilibrium) is what you get to borrow, not a literal property of your hardware. And the history is partly convergent rather than copied: softmax was also reached independently through logistic regression and through Luce's axiom of choice, so its lineage isn't purely thermodynamic. But the equation you end up holding, by whichever road you arrive, is Boltzmann's. The disciplines fought over the 2024 Nobel because the object really does belong to both of them.

Why an identity beats an analogy

Most cross-domain connections are analogies (“X is like Y”) and analogies are fragile things. You have to check them part by part: this feature maps, that one doesn't, here the comparison bends and breaks. An identity is a different animal entirely. When two things are the same object, the full toolkit of one transfers to the other for free, with no checking required, because there is nothing to check; they're not two systems that resemble each other, they're one system described in two vocabularies. So “softmax is the Boltzmann distribution” is not a clever observation to nod at and move past. It is a key to a hundred and fifty years of statistical mechanics that you can turn in the lock of your own sampler, verbatim. The rest of this essay is just walking through the toolbox that key opens.

Temperature is a noise budget, not a creativity dial

Start with the one piece everyone already half-knows, because the physics fixes the mental model. People call temperature a “creativity dial,” which gets the direction right and the mechanism wrong, and the wrong mechanism is what gets you in trouble. In the physics, temperature is a thermal-noise budget: how much energy the system has available to jiggle itself out of whatever state it has settled into. Turn it up and the distribution flattens toward uniform, the system has enough thermal energy to go visit high-energy, low-probability states, which in sampling terms is exploration. Turn it down and the distribution sharpens toward the single lowest-energy mode, exploitation.

And the endpoint is where the reframing earns its keep. At T = 0, temperature sampling becomes argmax: the model always picks its single highest-scoring token. The folk model says this is “minimum creativity,” as if you'd slid to one end of a smooth continuum. The physics says something sharper and more useful: at zero temperature the system freezes into its ground state, the one lowest-energy configuration, the way a liquid with all its thermal jiggling removed snaps into a rigid crystal. That word, freezes, is not poetic. It predicts a specific failure, and that failure is the most under-appreciated thing the identity hands you.

Your sampler has phase transitions, and they've been measured

Here is the payoff the “creativity dial” framing literally cannot give you, because dials don't do this and phases do. In physics, a smooth change in temperature can produce a sudden, qualitative change in a system's behavior. Water does not become gradually more solid as you chill it; it is liquid, liquid, liquid, and then at zero degrees Celsius it is ice. The transition is a cliff, not a slope, a discontinuity in the system's character as a control parameter crosses a threshold.

If your sampler is running the Boltzmann distribution, it should have these too. It does. In 2024, researchers published a paper titled, with no hedging, “Critical Phase Transition in a Large Language Model” (arXiv 2406.05335). They swept the sampling temperature of GPT-2 and reported what they described as the first convincing numerical evidence that a practical language model exhibits a phase transition: abrupt, qualitative shifts in output behavior at particular temperature thresholds, “akin to phase changes in physical systems.” And every practitioner has felt this without having the word for it. Push temperature up and the output does not get smoothly, gradually more creative; at some point it falls off a cliff into word-salad incoherence. Pull temperature down and the output does not get smoothly, gradually more focused; at some point it freezes into repetition, looping, self-reinforcing patterns. That low-temperature collapse into a single rigid groove is mode collapse, and it is precisely the ground-state freezing the temperature reframing told you to expect.

One honest qualification, because the physics demands it: a strictly non-analytic phase transition, in the mathematical sense, exists only in the limit of an infinite system. A finite model, even a large one, shows a very sharp crossover, transition-like behavior, which is exactly what that paper carefully measures and claims, no more. But the engineering lesson survives the asterisk completely. You are not turning a dial along a smooth continuum where 0.71 behaves like a slightly bolder 0.70. You may be standing next to a boundary, and the comfortable folk wisdom of “just set temperature to about 0.7” is a collective, superstitious memory of roughly where one of those boundaries sits for one class of model. Knowing where your model's transitions actually are (sweeping the temperature and watching for the cliff edges) is sampler engineering. Copying 0.7 from a tutorial is vibes.

The number you throw away contains the whole system

Now the piece almost nobody uses, and it's sitting in your code already. The softmax denominator (the sum of the exponentiated logits, or in log space the log-sum-exp of them) is treated as pure bookkeeping, the quantity you divide by so the probabilities add to one. In statistical mechanics that quantity is the partition function Z, and it has a property that borders on unreasonable: once you know Z, you know everything about the system. Its average energy, its entropy, its fluctuations, all of them fall out of Z by differentiation. The log of it, scaled by temperature, is the free energy, F = −T ln Z, which is the quantity the system actually minimizes; the probabilities are downstream of that.

For you, the practitioner, this cashes out as a free and principled signal you are currently computing and then immediately discarding. The log-sum-exp of your logits is a single scalar that summarizes the entire next-token distribution's spread and confidence. When the model is certain (one token towering over the rest) that number looks one way; when the model is guessing (a flat smear across many tokens) it looks another. You don't need a separate uncertainty head or an ensemble to get a confidence gauge; the log-partition is one, derived from first principles, produced on every forward pass, and thrown in the trash the instant you normalize. Physics spent a century learning that this number is the most informative object in the system. Your inference loop computes it and looks away.

Softmax is the honest distribution, and annealing is a blacksmith's trick

Two more tools, quickly, because they change why you trust the thing. First: the softmax is not an arbitrary squashing function someone happened to pick. The physicist E. T. Jaynes showed in 1957 that the Boltzmann form is the maximum-entropy distribution consistent with a known average energy, the distribution that assumes the absolute least beyond what your constraints actually tell you. Translated, temperature sampling is not a hack stapled onto a classifier; it is principled inference, the most honest, least-committal distribution available given the scores the network produced. (Jaynes's broader subjectivist reading of probability is still genuinely debated, so take this as illumination rather than settled doctrine, but as an answer to “why this function and not some other squash,” it's a good one.)

Second: the temperature schedule isn't new either. In 1983, Scott Kirkpatrick, C. D. Gelatt, and M. P. Vecchi published “Optimization by Simulated Annealing,” taking a 1950s physics simulation method and adding one move: start the system hot, so it has enough thermal noise to leap out of local minima and roam, then cool it slowly so it settles into the global optimum, exactly the way you anneal metal by heating and slowly cooling it to drive out defects. Stuart and Donald Geman proved the following year that a slow-enough (logarithmic) cooling schedule provably finds the global optimum. So when you start generation hot to explore and cool it to commit, you are running, knob by deliberate knob, a procedure named for what a blacksmith does to steel.

Pick up the toolbox

The shift this asks of you is small and the upside is free, because you are already doing the physics whether you name it or not. Stop treating temperature as a creativity slider on a smooth line, and start treating it the way the person whose equation you're computing would. Concretely, three things:

Map your model's phase boundaries instead of inheriting a magic number: sweep the temperature and find the cliffs, the onset of incoherence going up and the onset of repetition going down, because those crossovers, not the round numbers in the docs, are the edges of your real operating envelope.
Read the partition function you already compute: the log-sum-exp of the logits is a first-principles confidence signal; log it, watch it, alert on it.
When you need to escape a bad mode, anneal: start hot, cool slowly, because that is the exact procedure the math was built to run.

The deeper thing the grumbling physicists handed everyone in October 2024 is that the line between “your field” and “their field” is mostly an accident of which textbook you happened to open first. The equation does not know which department it's filed under. Boltzmann wrote it down to explain why a gas has a temperature; you compute it to choose the next word; it is, down to the last symbol, the same equation. And behind it stands a hundred and fifty years of brilliant people who thought very hard about what it means, a toolbox you already own and have mostly been declining to open. The physicists were right that it's just statistical mechanics. That was never an insult. It was directions.

Sources: the 2024 Nobel Prize in Physics (announced October 8, 2024) to John J. Hopfield and Geoffrey E. Hinton, “for foundational discoveries and inventions that enable machine learning with artificial neural networks” (NobelPrize.org), and the disciplinary controversy that followed, the argument that the work has “only a lateral relationship to physics” and the fair critique that earlier pioneers (e.g., Ivakhnenko, Amari) went unrecognized (IEEE Spectrum and related commentary). Hinton's Boltzmann machine (Hinton & Sejnowski, ~1985), which uses P(state) ∝ exp(−E/T) and is named for Boltzmann, and Hopfield's energy-based network (1982) as the stat-mech foundations cited. The Boltzmann distribution and Gibbs's partition function (Ludwig Boltzmann, 1870s; J. W. Gibbs, Elementary Principles in Statistical Mechanics, 1902); free energy F = −T ln Z; E. T. Jaynes, “Information Theory and Statistical Mechanics” (1957), the maximum-entropy derivation. Simulated annealing: Kirkpatrick, Gelatt & Vecchi, “Optimization by Simulated Annealing,” Science (1983); Geman & Geman (1984) on logarithmic cooling and convergence. Measured phase-transition behavior in a real model: “Critical Phase Transition in a Large Language Model” (arXiv 2406.05335, 2024), a GPT-2 temperature sweep reporting transition-like behavioral shifts, with low-temperature mode collapse as ground-state freezing. The claim is an identity of mathematical form (ML temperature is dimensionless, k_B absorbed; logits are not physical energies), and softmax also has convergent non-thermodynamic derivations (logistic, Luce choice); a finite model exhibits a sharp crossover rather than a strictly non-analytic transition, which the cited paper measures as such; Jaynes's interpretation is presented as illuminating, not settled.

You already compute the confidence signal. Then you throw it away.

The log-partition is a first-principles measure of how certain the model was at each step, produced on every forward pass and discarded the instant you normalize. For an autonomous agent, that per-decision confidence is exactly the kind of signal that should survive into the record: not just what the agent did, but how sure it was when it did it, and whether it was sampling near one of those phase boundaries when it went off the rails. Chain of Consciousness anchors an agent's actions to a tamper-evident record, so the confidence behind a decision is something you can audit after the fact instead of a scalar that vanished at inference time.

See a verified action chain · Hosted Chain of Consciousness

pip install chain-of-consciousness · npm install chain-of-consciousness

← Back to all posts