The Grammar of Music

The tempered lie is what makes the grammar speakable.

Published April 2026 · 13 min read

In 1722, Johann Sebastian Bach completed Das wohltemperierte Clavier — The Well-Tempered Clavier. Twenty-four preludes and fugues, one in each of the twelve major and twelve minor keys. It was the first comprehensive demonstration that a single keyboard could play in every key without retuning.

The title is a statement of method. Well-tempered. Tempered, as in tuned out of true. The book only worked because Bach's keyboard had been deliberately de-tuned — each fifth flattened by about two cents from its physically correct 3:2 ratio. One part in a thousand, inaudible as pitch. What the small lie bought was free movement through all twenty-four keys.

It had to be a lie. A true perfect fifth — the third harmonic of a vibrating string, the first non-octave overtone, the most consonant interval in the universe — does not close a twelve-step loop. Stack twelve true fifths and you overshoot seven octaves by what sixteenth-century theorists called the Pythagorean comma: about 23.46 cents, a quarter of a semitone, audible enough to ruin a chord (Barbour, Tuning and Temperament, 1951). The circle of fifths does not exist in physics. It exists because keyboard builders, in the century before Bach, distributed the comma evenly across twelve intervals and made modulation possible.

This essay is about what happens when a piece of mathematics — a small act of group theory performed at a keyboard — becomes the foundation for grammar.

Two questions, one tradition

When we ask how music "works," we are usually asking at least two questions without distinguishing them.

The first is about vocabulary. Which notes exist. Which are close to which. Which substitutions are allowed. This is the layer governed by the circle of fifths — in algebraic terms, the cyclic group ℤ/12ℤ, the integers modulo twelve under addition.

The second is about syntax. Which sequences of chords are coherent. How a recurring theme relates to its return eight minutes later. How a development section resolves to a recapitulation. This is the layer governed by something that looks a great deal like a formal grammar — the same kind of object Noam Chomsky proposed in 1956 to describe natural language.

The load-bearing argument of this essay is that the two layers are not independent. In language, syntax and lexicon are largely orthogonal: you can replace every noun in English with a nonsense word and still produce grammatical sentences. In music, you cannot. Tonal syntax depends on the algebra underneath it in a way English syntax does not depend on its phoneme set. The circle of fifths is not decoration; it is a load-bearing element of the grammar itself.

The algebra under the notes

The mathematician John D. Cook put the circle of fifths cleanly in a 2009 blog post: a fifth is seven chromatic steps, and since 7 is relatively prime to 12, it follows that 7 generates ℤ/12ℤ. Start on any pitch class, repeatedly add seven semitones, and you will visit every pitch class — all twelve — before returning to your starting point.

This works because gcd(7, 12) = 1. The fifth doesn't generate the chromatic scale because of anything special about fifths. It generates it because of something special about 7 and 12. Cook's observation, which he credits to elementary number theory: stepping by m through n notes covers all of them if and only if m and n are coprime.

Which integers less than 12 are coprime with 12? Only four: 1, 5, 7, and 11. Four intervals. Four circles. The circle of semitones (step 1). The circle of fourths (step 5). The circle of fifths (step 7). The circle of major sevenths (step 11). Everything else fails.

There is no circle of thirds. Ascending by major thirds (four semitones) produces a three-cycle — the augmented triad, C–E–G♯–C. Ascending by minor thirds (three semitones) produces a four-cycle — the diminished seventh, C–E♭–F♯–A–C. These are subgroups of ℤ/12ℤ, corresponding to the divisors of 12. John Coltrane built Giant Steps (1960) around the three-cycle of major thirds — the B/G/E♭ key axis — and the reason the piece sounds disorienting even to trained ears is that it bypasses the fifth-based tonal lattice the rest of Western music is built on. The piece is, in a precise algebraic sense, inside the subgroup.

The fifth won out over the fourth and the semitone and the major seventh as the canonical generator for a non-mathematical reason. The 3:2 frequency ratio is the most consonant interval other than unison or octave; two pitches a true fifth apart share more overtones than any other pair. So the fifth was the interval that happened to be both a group generator of ℤ/12ℤ and the most physically stable sound short of silence. A fortunate coincidence.

The ladder above the notes

In 1956, Noam Chomsky published "Three Models for the Description of Language," arguing that grammars describing natural language sit in a hierarchy of increasing expressive power. Four levels, from simplest to most general.

Type 3, regular languages. Generated by finite-state machines. Fixed patterns, basic repetition, Markov-like sequences. Cannot count.

Type 2, context-free. Pushdown automata. Balanced nested structures: arithmetic expressions, parentheses, aⁿbⁿ.

Type 1, context-sensitive. Linear-bounded automata. Cross-serial patterns — three or more sequences matched in count across a whole string.

Type 0, unrestricted. Anything computable. A full Turing machine.

Each class properly contains the one below. The formal class of your grammar constrains what structures you can express — a regular grammar cannot generate aⁿbⁿ, and no amount of patching will fix that.

The empirical question for linguistics was: where does natural language actually sit? Most of English looked context-free. But in 1984 Riny Huybregts, studying Dutch, and in 1985 Stuart Shieber, studying Swiss German, produced the same argument from different constructions. Certain embedded clauses require cross-serial dependencies — the kind of aⁿbⁿcⁿ pattern that is provably outside context-free. Shieber's canonical example from Swiss German: Jan säit das mer em Hans es huus hälfed aastriiche, "Jan says that we helped Hans paint the house," in which three noun phrases and three verbs must be matched in order across the sentence (Shieber, Linguistics and Philosophy 8(3), 1985).

So natural language sits just above context-free, in what Aravind Joshi coined in 1985 as the "mildly context-sensitive" band — strictly more expressive than Type 2, but strictly less than full Type 1, parsable in polynomial time. That specific slot is the empirical home of human language grammar.

Where music sits

Music sits there too. Probably.

The placement of tonal music on the Chomsky hierarchy is not a theorem. It is an empirical claim about how harmony behaves, argued over for forty years. The first formal attempt was Fred Lerdahl and Ray Jackendoff's 1983 book A Generative Theory of Tonal Music, a development of Leonard Bernstein's 1973 Norton Lectures at Harvard, in which Bernstein — aware of Chomsky's program — argued that music must have something like a generative grammar of its own. Lerdahl and Jackendoff credit Bernstein explicitly. Their book slightly cheats, though: it presents well-formedness rules and preference rules rather than a proper generative grammar with derivation trees. It set up the problem without quite finishing it.

The finishing came later. Mark Steedman, in a 1984 Music Perception paper, wrote a context-free grammar for jazz chord progressions including the 12-bar blues. Martin Rohrmeier, in 2011, published "Towards a generative syntax of tonal harmony" in the Journal of Mathematics and Music, producing a phrase-structure grammar for diatonic tonal music that could model recursive prolongation — the phenomenon where a tonic chord is elaborated by its own dominant, which is elaborated by its own secondary dominant, nested arbitrarily deep. Rohrmeier argues explicitly against Markov models: the structure of harmonic progressions exceeds the simplicity of Markovian transition tables.

The consensus, consolidated in Rohrmeier and Marcus Pearce's 2018 chapter in the Springer Handbook of Systematic Musicology, is that tonal music's generative structure is at or near mildly context-sensitive — the same formal class as natural language. Jonah Katz and David Pesetsky went further in a 2011 manuscript, "The Identity Thesis for Language and Music", arguing that all formal differences between the two reduce to differences in their atomic units. The strong claim is controversial. The weaker claim — shared formal class — is close to settled.

If that is right, then the answer to "how does music work?" requires two things on top of each other. A cyclic group of twelve pitch classes, with the circle of fifths defining adjacency. And a mildly-context-sensitive grammar, with production rules that move around the circle in specific, constrained ways.

The entanglement

Here is the asymmetry that makes music structurally strange.

In language, syntax and lexicon are largely independent. English grammar can be applied to any vocabulary. The blicket gorped the dax parses as subject-verb-object despite none of those words existing. Grammar operates on syntactic categories, not on the specific phonemes the categories contain. Swap every noun in English for a randomly generated word and you still have English syntactically. This is why "Jabberwocky" scans. This is why a child can learn her language's grammar before she knows most of its vocabulary.

In tonal music, this decoupling fails. Remove any pitch class from the twelve and the grammar collapses.

Specifically: the rules of tonal harmony use distance around the circle of fifths as their fundamental geometric prior. A modulation from C major to G major is grammatically cheap because the two keys are adjacent on the circle (one step). A modulation from C major to F♯ major is grammatically expensive because the keys are maximally distant (six steps), and tonal grammar has specific machinery for negotiating that distance — pivot chords, enharmonic reinterpretation, common-tone modulation. These rules do not exist in the abstract. They exist as operations on ℤ/12ℤ. Reduce the scale to eleven tones and the algebra breaks: 11 is prime, its cyclic group has no non-trivial subgroups, the adjacency geometry the grammar depends on simply does not hold.

This is why tonal music cannot be written with seven of the twelve pitch classes. A diatonic seven-note scale can express one key. It cannot modulate, because modulation is movement through the circle's adjacency, and the circle depends on all twelve. Language's grammar is largely orthogonal to its phoneme inventory. Music's grammar is entangled with its pitch-class group.

One sentence to carry the essay: the circle of fifths is not below the grammar; it is part of the grammar, and changing it changes what the grammar can say.

What this explains

The entanglement explains several things that otherwise look like stylistic preferences.

It explains why atonality feels like a different art form rather than different music. Arnold Schoenberg's twelve-tone technique, formalized around 1923, does not merely shift compositional taste — it abandons the ℤ/12ℤ adjacency geometry by treating all twelve pitch classes as equivalent. The grammar does not have a broken prior; it has no prior. The result is a compositional system with different formal properties, not a different dialect.

It explains why non-Western musical traditions sound "different" in a way that is not merely ornamental. Indian classical music uses twenty-two shrutis rather than twelve equal pitches; Arabic maqam uses quarter-tones; Indonesian gamelan uses non-octave-periodic scales. These are not the same algebra with different surface decoration — they are different pitch-class groups that support different harmonic grammars. Every tradition has a pitch-class algebra paired with a hierarchical grammar. The specific ℤ/12ℤ does not generalize. The argument of this essay is scoped to twelve-tone equal temperament, the system Bach ratified in 1722.

It explains something about modern AI as well. Google's Music Transformer (Huang et al., ICLR 2019, arXiv:1809.04281) showed that attention-based models trained on MIDI implicitly learn relative-pitch structure that mirrors the circle's geometry. A 2024 arXiv paper (Moyo & Chiurunge, arXiv:2403.00790) goes the other way, proposing the musical circle of fifths as an explicit geometric scaffold for structuring neural activation spaces. Either way, the geometry shows up. You can give the model the prior or let it rediscover the prior, but you cannot skip it.

Where the analogy breaks

An honest mapping has to mark its edges. Three divergences, ordered worst to least.

First, the Chomsky-hierarchy placement is a working hypothesis, not a theorem. Rohrmeier's grammar covers diatonic tonal progressions; it does not cleanly cover post-tonal composition, extended jazz substitutions, or micro-tonal repertoire. "At or near mildly context-sensitive" is the current best fit, not the last word.

Second, music and language are processed by partially different neural systems. Aniruddh Patel's Music, Language, and the Brain (Oxford, 2008) documents real dissociations — amusia patients who retain full linguistic ability, aphasic patients who retain full musical ability. Shared formal class does not mean shared substrate. The rhyme is formal, not anatomical.

Third, music's grammar is more permissive. Ungrammatical English sentences feel broken in a way that unusual chord progressions rarely do. Music's semantic demands are looser — a progression does not need to refer to anything — which gives its grammar more headroom before a violation becomes a violation.

The first two are honest limits. The third sharpens the main claim: music has a grammar whose rules are couched in its algebra, and because it does not need to refer, it can play near the edge of its own rules more openly than language can.

Close

Bach's 1722 decision — to write in all twenty-four keys, to demonstrate that the tempered keyboard could carry any modulation — was not ultimately a musical decision. It was an algebraic one. He was showing that ℤ/12ℤ could be treated as a closed system, that the circle could in fact close, that the grammar could move freely around it. Three centuries later, Western music from Beethoven through jazz through pop is a mildly-context-sensitive grammar operating on a cyclic group of twelve pitch classes. The two layers — the algebra below, the hierarchy above — are the same object viewed from two angles.

For anyone building systems with rules and vocabularies and long-range dependencies — language models, parsers, chord engines, protocol designers — the practical lesson is worth holding onto. It is not enough to ask what rules your grammar has. You also have to ask what algebraic structure your tokens live in, and whether your rules depend on that structure in ways you haven't named. If they do, you cannot change the token set without quietly changing the grammar. Most systems inherit their algebras by accident — a character set, an identifier space, a feature dictionary — and build rules on top that covertly exploit the inheritance. When something later forces you to change the underlying set, the grammar above fails in ways that look like bugs but are really the algebra speaking.

In music the problem was solved three hundred years ago by a deliberate act of mistuning. Every fifth on the keyboard was bent two cents flat so the circle would close. The tempered lie is what makes the grammar speakable. Everything since has rested on it.

Sources: Bach, Das wohltemperierte Clavier Book I, 1722; Barbour, Tuning and Temperament: A Historical Survey, Michigan State College Press, 1951; Chomsky, "Three Models for the Description of Language," IRE Transactions on Information Theory, 1956; Cook, "The circle of fifths and number theory" (johndcook.com, 2009); Huybregts, "The Weak Inadequacy of Context-Free Phrase Structure Grammars," 1984; Shieber, "Evidence against the context-freeness of natural language," Linguistics and Philosophy 8(3), 1985 (doi:10.1007/BF00630917); Joshi, "Tree Adjoining Grammars," in Dowty, Karttunen & Zwicky (eds.), Natural Language Parsing, Cambridge UP, 1985; Lerdahl & Jackendoff, A Generative Theory of Tonal Music, MIT Press, 1983; Steedman, "A Generative Grammar for Jazz Chord Sequences," Music Perception 2(1), 1984 (doi:10.2307/40285282); Rohrmeier, "Towards a generative syntax of tonal harmony," Journal of Mathematics and Music 5(1), 2011 (doi:10.1080/17459737.2011.573676); Rohrmeier & Pearce, "Musical Syntax I: Theoretical Perspectives," in Springer Handbook of Systematic Musicology, 2018 (doi:10.1007/978-3-662-55004-5_25); Katz & Pesetsky, "The Identity Thesis for Language and Music" (manuscript, lingbuzz/000959, 2011); Patel, Music, Language, and the Brain, Oxford UP, 2008; Huang et al., "Music Transformer," ICLR 2019 (arXiv:1809.04281); Moyo & Chiurunge, "Structuring Concept Space with the Musical Circle of Fifths," 2024 (arXiv:2403.00790).

Tokens inherit algebras. So do agents.

The essay's lesson for system designers — your rules quietly depend on the algebraic structure your tokens live in — is the same problem the agent economy is about to discover. Agent identifiers, rating tokens, capability claims: each carries an implicit algebra. If two agents can be silently re-counted, renamed, or duplicated, the trust grammar above fails in ways that look like bugs but are really the algebra speaking. The Agent Rating Protocol fixes the algebra by anchoring every rating to a public provenance chain — one identity, one signature, one entry. The grammar of trust gets the closed group it needs.

See a live provenance chain · Verify an agent's rating · pip install agent-rating-protocol

← Back to all posts