The fifth rewrite is not going to fix the five cases. Brooks said why, forty years before you started.
You are on the fifth rewrite of the system prompt. You have added bullet points, moved the “always” to the top and the “never” to the bottom, tried structured XML, a JSON version, and a version that begins with You are a senior engineer with twenty years of experience. The model is still doing the wrong thing in the same five cases it has been doing the wrong thing in since you started. You suspect, in a way you have not yet said out loud, that no rewrite will fix it.
You are correct. The reason was published in 1987 by Fred Brooks, of Mythical Man-Month fame, in a paper that has accumulated something like ten thousand citations and is taught in every graduate software engineering course on earth. The paper is called “No Silver Bullet — Essence and Accident in Software Engineering.” The relevant sentence is the one most often quoted out of context: The hard thing about building software is deciding what one wants to say, not saying it.
Replace building software with prompting an LLM and replace saying it with prompting it. You get, forty years before anyone had ever heard of the technology, the precise diagnosis of why prompt engineering plateaus.
This essay is about that diagnosis. What Brooks claimed in 1986, what he was right about, what he was wrong about, and what it tells you about the system prompt you are still rewriting.
Brooks split software complexity into two categories. Essential complexity is the difficulty inherent in the problem itself — specifying what you want, designing how the pieces fit, deciding what counts as correct behavior in edge cases the original spec didn't anticipate. Accidental complexity is the difficulty introduced by your tools and representations — writing in assembly versus a high-level language, debugging without version control, compiling slowly, formatting your code by hand.
Tools, Brooks argued, can only reduce accidental complexity. By 1986 they had reduced most of it: structured programming, high-level languages, time-sharing, and unified development environments had taken the multi-day edit-compile cycle down to seconds. What remained was essential complexity. And no further tool — no better language, no smarter compiler, no new methodology — could deliver another order-of-magnitude productivity gain, because the bottleneck was no longer in the tools.
He identified four properties that made software essentially complex. Complexity: software has no repeating parts at the conceptual level — every function and interface is unique, unlike buildings made of identical bricks or circuits made of identical transistors. Conformity: software must interface with an arbitrary world of legacy systems, external APIs, regulations, and other people's decisions. Changeability: software embodies its function, and the function is exactly the part that is always under pressure to change. Invisibility: software has no natural spatial representation — every diagram is a partial view from one angle, and no single view captures the whole.
These four properties are why the silver bullet doesn't come. Each technological hope of the 1980s — Ada, object-oriented programming, AI, expert systems — addressed some accidental complexity and ran into the same wall against essential complexity. The 10x productivity gain Brooks said was impossible has not arrived in the forty years since, although individual accidental gains have stacked up impressively.
The interesting move is to take Brooks's four hopes and port them to 2024-2026.
Bigger models are the new Ada. They are dramatically better at executing whatever prompt you put in front of them. GPT-5, Claude 4, Gemini Ultra outperform their predecessors on every benchmark that captures execution quality. They do not, however, reduce the essential complexity of deciding what to put in the prompt. Ian Cooper made this exact port in June 2024: LLMs can add value as an expert assistant but this cannot be an order of magnitude improvement due to the fact that it cannot impact the essential complexity, only the accidental complexity where no order of magnitude improvement remains to be obtained. The pattern Brooks identified for languages is the pattern Cooper observes for models.
Better prompts are the new OOP. Chain-of-thought, few-shot examples, structured prompting with XML or markdown headers — these address the accidental complexity of communicating with the model. They are real improvements. Few-shot prompting in particular can take a model from 0% to 90% accuracy on tasks where the examples capture the essential specification (Schulhoff, Lenny's Newsletter, 2025). But the improvements behave like OOP improvements: they reduce the duplication, they sharpen the interface, they do not reduce the essential difficulty of figuring out what the model is supposed to do in cases the examples don't cover. Schulhoff's own finding that role prompting (“you are a math professor”) is “largely ineffective” on correctness is the cleanest illustration: changing the surface (tone, vocabulary, register) does not change the substance.
Multi-agent frameworks are the new AI-as-silver-bullet. CrewAI, AutoGen, LangGraph, the A2A protocol, the entire orchestration layer — each distributes the essential complexity across multiple specialized agents. None of them reduces the essential complexity. Who decides what correct means when two agents disagree? is the same question as who decides what correct means in a single agent? — it has only been moved one level up the stack. The coordination complexity is additive: more agents means more specification surface, not less.
Autonomous tooling is the new expert systems. Self-improving agents, automatic prompt optimization, DSPy-style pipelines that search the prompt space — these were going to be the systems that captured human expertise and ran without humans in the loop. They are useful, sometimes dramatically, on well-structured tasks where “correct” is unambiguous. They fail on ill-defined problems — exactly as Brooks predicted forty years ago for expert systems. Prompt injection, in this lens, is not a temporary security gap to be patched. It is the same essential-complexity problem that beat expert systems: the tool cannot verify that the input matches the intent, because intent is in the head of the person who specified the task, not in the input.
In each case, the modern hope does what Brooks's 1986 hopes did: reduces real accidental complexity, cannot touch essential complexity.
Brooks's specific claim was that no single technique would yield a 10x improvement. It is worth running the test against the actual prompt-engineering toolbox.
Chain-of-thought prompting: roughly 2-3x on reasoning tasks (Wei et al., 2022). Few-shot prompting: up to 90x on tasks where the baseline is near zero — but this is a category unlock, not a productivity multiplier. The model went from “cannot do it at all” to “can do it”; the same shift happened when high-level languages let programmers write things assembly couldn't reach. Once you have few-shot examples, no further prompting trick gives you a second 90x. Structured prompting with XML and headers: roughly 1.2-1.5x by industry consensus, not formally measured. Retrieval-augmented generation: 2-4x on knowledge tasks (Lewis et al., 2020). Automatic prompt optimization with DSPy or OPRO: 1.5-3x, varying by task.
The progression follows the diminishing-returns curve Brooks described for software tools. Basic prompting in 2022 was the move from assembly to high-level languages — a massive accidental-complexity reduction. Chain-of-thought was the debugger. Few-shot was the code library. Structured prompting was the IDE feature. Context engineering is the build system. Automatic optimization is the profiler. Each subsequent tool addresses a smaller fraction of the remaining accidental complexity. The essential complexity is unchanged at every step. The wall is where Brooks said the wall would be.
The most honest sentence from Schulhoff's interview, when asked which prompting techniques actually work in 2025, listed two: chain-of-thought and few-shot. Everything else is variations on those two themes, with marginal additional gains. Forty years of tooling experience predicted exactly this trajectory in advance.
In June 2025, Andrej Karpathy made what should be read as a Brooks-style intervention. He argued that the field should stop saying “prompt engineering” and start saying context engineering: the work of filling the context window with the right information for the next step. His metaphor: the LLM is the CPU, the context window is RAM, the engineer is the operating system. Don't optimize the wording of the instruction — optimize what information the model has access to when it processes the instruction.
This is a Brooks-move. Prompt engineering focuses attention on the surface of the instruction (accidental). Context engineering focuses attention on the substance of what the model needs to solve the problem (closer to essential). The rename is not cosmetic. It is an explicit redirect from optimizing the tool to understanding the problem, which is exactly what Brooks prescribed in 1986.
But Karpathy is not the silver bullet either, and reading his framing carefully shows why. The operating system, in Brooks's framework, is precisely the layer that manages accidental complexity — memory allocation, process scheduling, I/O buffering. The OS does not decide what programs to run or what they should do. That is the user's essential responsibility. Context engineering, by direct analogy, manages the accidental complexity of the context window — what to load, when to evict, how to format, when to summarize. It does not determine what the agent is supposed to do, which is the system prompt designer's essential responsibility. Karpathy is right that engineers should think like OS designers. Brooks is right that no OS design eliminates the essential complexity of the applications running on top of it.
The Karpathy reframing is a real advance — the field is better off measuring context relevance than measuring prompt phrasing — but it is not, and Karpathy did not claim it was, a 10x multiplier on the essential specification problem. It moves the boundary. It does not dissolve it.
Brooks made one specific mistake worth naming. In 1986 he listed AI among the four hopes and predicted it would not be a silver bullet for software engineering. He was wrong about AI's impact on accidental complexity. GitHub Copilot, Claude Code, and the broader generation of AI coding assistants have demonstrably reduced the time programmers spend on boilerplate, syntax debugging, and test generation. The accidental-complexity reduction is real, large, and measurable.
But he was right about the deeper claim. Ask any developer using a frontier AI coding assistant whether their software is 10x better than the software they shipped before the tool existed. The answer is no. It is produced faster, in some workflows much faster, but the bugs that ship are the same kinds of bugs, the design decisions are still the hard part, the architecture reviews still take days, the requirements still come back changed. Brooks's prediction that AI would not be a silver bullet for software came out wrong; his prediction that no tool could touch essential complexity came out right; and AI itself turned out to be the cleanest confirmation of the broader framework.
The same pattern is going to play out for prompt engineering. Better prompting techniques will continue to reduce accidental complexity. Models will get better at parsing intent from imperfect prompts. Context engineering tools will assemble relevant information more reliably than human prompt designers can. Multi-agent orchestration will handle distribution and parallelism that single prompts cannot. All of these are real, and most of them are already underway. None of them will turn out to be the silver bullet, because the essential complexity of deciding what the agent should do is exactly the part no tool can touch.
The framework has now survived four waves of technology change since 1986. OOP in the 1990s. Agile in the 2000s. DevOps and cloud in the 2010s. AI in the 2020s. Each wave brought real accidental improvements. None broke the 10x ceiling. That track record is itself evidence: Brooks's framework was not a prediction about Ada or expert systems, it was a structural claim about what tools can and cannot do, and the structural claim has held across forty years of unrelated technologies.
Three practical moves follow.
The first is diagnose before rewriting. Before you change another word of the system prompt, ask whether the failure mode you are trying to fix is essential or accidental. The model doesn't follow my formatting is accidental — fix it with better structuring or a different parser. The model picks the wrong action when the input is ambiguous is essential — no prompt change will fix it, because you have not specified what the right action when the input is ambiguous means in your domain. Diagnosing first will save you the next four rewrites.
The second is grow agents through operation, not specification. Brooks's prescription in 1986 was to stop trying to specify the system upfront and start growing it iteratively through use. Apply this to agents directly. Ship a minimal system prompt. Observe failures. Add to the prompt only the cases the failures actually reveal. The model's own production behavior is the only authoritative source of information about which essential-complexity edges actually bite in your domain. No amount of upfront prompt engineering will surface those edges; only operation will. The system prompt should be a log of every essential disambiguation you have had to make, not a predictive model of every disambiguation you might one day need.
The third is invest in the human judgment Brooks said couldn't be replaced. The essential complexity of agent design is, by Brooks's argument, irreducibly a human-judgment problem. Quality definitions, edge-case priorities, conflict resolution between objectives — these are not specifiable by any framework because the framework would need someone to specify it, and that someone is the irreducible part. The right organizational move is to identify the people on your team who are good at this work and to give them more of it, rather than to look for the framework that will let anyone do it. Brooks called these people great designers; the modern equivalent is the engineer who consistently writes the prompt that catches the edge case the rest of the team didn't see. Those engineers are scarce and they are not interchangeable.
The system prompt you are rewriting for the fifth time is doing the wrong thing in those five cases because no rewrite specifies the right thing well enough. The next rewrite will not change that. The version after will not change it either. What changes it is the part that has not changed since Brooks wrote it down in 1986: deciding what you want the model to do is the part of the job that is your job. The model executes. The framework distributes. The tools format. The essential complexity is yours, and no prompt will take it from you.
A log of every essential disambiguation you have had to make.
The essay's middle prescription — grow agents through operation, not specification — only works if the operational record is reliable enough to learn from. The failures you actually observe are the only authoritative source of information about which essential-complexity edges bite in your domain, and they only stay learnable if they are recorded with what the agent actually did at the time. Chain of Consciousness anchors every agent action to a verifiable external record so the next rewrite of the system prompt is based on what happened, not on what got reconstructed afterward.
pip install chain-of-consciousness · npm install chain-of-consciousness
Hosted Chain of Consciousness → · See a verified provenance chain