Why agent pipelines detect but don't stop — and what Sakichi Toyoda’s loom knew in 1924 that we’re slow to relearn.
In 1924, in a factory in Aichi prefecture, Sakichi Toyoda unveiled the Type-G automatic loom. The machine's signature trick was small: when a single thread broke, the loom stopped itself. That was it. The thread broke, the loom noticed, and the loom held still until someone fixed the problem.
Before the Type-G, a power loom with a broken thread would keep running. It would weave the missing thread's column as a long flaw down the cloth — a flaw that nobody discovered until final inspection, sometimes hours or days later, by which point the loom had produced yards of unusable fabric. Workers walked the loom rows watching for these breaks. One worker per loom, at most three or four looms each, because human eyes couldn't reliably track more than that in real time.
The Type-G changed the staffing math overnight. Detection was now automatic. One operator could monitor thirty to fifty looms instead of three. Toyoda Industries grew into one of the largest manufacturers in pre-war Japan on the back of this single mechanism, and the sale of the patent in 1929 to a British firm financed the founding of what would become Toyota Motor Corporation.
But the textbooks misread the innovation. The Type-G's trick wasn't detecting the broken thread — primitive mechanical sensors had been doing that for decades. The trick was stopping the machine. Detection without halt just generates faster waste. The cloth still has a flaw; the operator still has to throw it out. What the Type-G discovered, and what Toyota would spend the next forty years generalizing into the Toyota Production System, was that the load-bearing piece of any quality system is not detection. It's the stop.
I want to argue that most of what we're building today in multi-agent AI is detection without the stop. We've built the loom that notices the broken thread. We haven't built the part that holds still while someone fixes it.
Taiichi Ohno and Eiji Toyoda spent the postwar decades generalizing Sakichi's loom principle into a formal doctrine they called jidoka — “automation with a human touch.” The doctrine has four steps:
The Andon Cord — that yellow rope that runs above every Toyota assembly line and that any worker can pull — is the human-scale embodiment of step 2. Worker sees a problem. Worker pulls cord. Team leader has roughly sixty seconds (one takt-time cycle) to investigate. If the problem can be resolved within that window, the line keeps moving. If it can't, the whole line halts and stays halted until the problem is understood and corrected.
A working Toyota plant stops the line about 3,500 times per week. This number sounds catastrophic and is in fact the goal. A line that never halts is either perfect (no factory is perfect) or isn't detecting its defects (which is what every factory looked like before jidoka). The 3,500 stops are a health metric. Their absence would be a warning sign.
The economics underneath are the part most people miss. The lean-manufacturing standard for defect cost is brutally clear: a defect caught at the source costs roughly 1x. Caught one station downstream, 10x. Caught at final assembly, 100x. Caught by the customer, 1,000x. So when a Toyota line halts for ninety seconds, the calculation isn't “ninety seconds of lost production.” The calculation is “ninety seconds of lost production or hundreds of defective units flowing into stations that will multiply the cost of fixing them by ten or a hundred or a thousand.” Every second of halt prevents a downstream multiplier.
This is what Sakichi's loom understood mechanically and what most agent pipelines have not yet understood architecturally: production rate without quality is negative output. The cloth past the broken thread is not just useless. It is worse than no cloth — because someone has to handle it, sort it, scrap it, and account for the materials that went into it.
Look at how multi-agent pipelines actually behave in production today.
Augment Code's 2026 reliability report calls independent validation “the most underused reliability mechanism in multi-agent systems, with teams rarely verifying whether outputs meet original requirements.” Their separate report on multi-agent failure modes documents that “bad output in stage 1 cascades through every downstream stage with no backtracking, and a two-agent pipeline can produce silently wrong outputs within the first dozen runs if an upstream agent hallucinates.”
Ranjan Kumar's 2026 piece on pipeline orchestration introduces the term “Failure Blast Radius” — the number of downstream agents and systems that act on an upstream failure before it is detected and contained. The metric exists because the failures are not contained. Without the stop, the radius is unbounded by construction.
Carnegie Mellon's 2025 agent benchmarks found that even leading models complete only 30 to 35 percent of complex multi-step tasks autonomously. The other 65 to 70 percent produce partial, incorrect, or failed outputs. In a pipeline without halt, that 65–70 percent of bad outputs gets passed downstream. In a three-stage pipeline where each stage has a 0.33 completion rate, the expected end-to-end good-output rate is roughly 3.6 percent. The other 96.4 percent is either silent error or rework — depending on whether your downstream critic agent happens to catch it.
This is the math of negative output, restated for software. A pipeline producing more rework than good output is not producing. It is manufacturing rework. And the more the pipeline runs, the more rework it manufactures.
So why doesn't the industry just halt? Because the architecture we've actually built is jidoka step 1 without step 2. We have the detection. We don't have the stop.
The industry name for step 1 is the “critic agent” pattern — an independent agent whose job is evaluating other agents' outputs. Trantor's 2026 piece on AI agent failure modes recommends “adding an independent judge agent whose exclusive responsibility is evaluating other agents' outputs.” This is a good recommendation. It is also, by itself, the equivalent of a Toyota quality inspector marking a defect on a car body and letting it roll to the next station. The inspector has the badge to detect. They don't have the cord to halt.
Beam AI's “6 Multi-Agent Orchestration Patterns for Production” — sequential, parallel, hierarchical, market-based, stigmergy, blackboard — describes every pattern currently in deployment. Not one of the six has a built-in stop. They describe routing, fallback, and retry. The defect, when found, gets logged. The pipeline advances. Someone, asynchronously, eventually reviews the log. By then the next 47 outputs are also in the queue, also flagged, and the reviewer is now triaging detections rather than fixing the underlying cause.
This is jidoka with the cord cut. Detection, log, advance, queue for asynchronous review. The first half of the four-step loop happens. Steps 3 and 4 — corrective action and root-cause investigation — never happen, because the line never waits for them.
The genuinely strange thing is that the software industry already solved this problem in a different domain.
CI/CD pipelines work because they implement jidoka step 2 by construction. A test fails, the build breaks, the merge does not happen. The pipeline halts. The defect doesn't reach production. The phrase “build breaker” became a cultural marker in engineering organizations specifically because breaking the build is pulling the cord — it stops everyone downstream and forces the team to either fix the problem or revert the change.
Articles like DevOps.com's “You're Not Doing DevOps if You Can't Pull the Cord,” Stefan Thorpe's LinkedIn piece on Deming and the andon cord, and Chris Pont's “The Andon Cord and DevOps — The First Way” all make the connection explicit. The DevOps community adopted Toyota's vocabulary about a decade ago. The mechanism it adopted — the build-breaking, line-stopping default — is now table stakes.
Agent pipelines, somehow, are being built without it. The same companies whose CI/CD pipelines halt on a single test failure are deploying agent orchestrations that detect failures and advance anyway. The disconnect would be funny if the production cost weren't so high.
George Racu's “Beyond Kiro: Engineering Quality Gates for AI Agents” comes closest in current writing: “Quality Gates act as Andon Cords for AI agents. If any gate fails, the pipeline stops, the implementation doesn't happen, the merge doesn't happen.” This is the right pattern. It is also, by my honest reading of the deployed landscape in 2026, rare.
Why don't we pull the cord? It's worth being specific about the failure modes, because they're not all technical.
The retrofit problem. Pulling the cord interrupts everything downstream. That interruption is the point. But if you have a running pipeline serving real users, instrumenting halt-on-failure means you've signed up for visible downtime — your users see the system stop. Detection-without-stop hides the problem from the user's view (they get a wrong answer instead of a paused screen) while paying for it in correctness debt. The retrofit is socially harder than it is technically hard.
The committee bottleneck. This is the design question the essay thesis singled out, and it's the hardest one. Who has authority to halt the line? At Toyota, the answer is anyone, including the most junior line worker. They pull the cord, the team leader investigates within sixty seconds, and only if that investigation doesn't resolve the issue does management get involved. The crucial structural feature is that halt authority sits with the person closest to the problem — not with management above them, not with a quality committee. Agent pipelines that route halt decisions through a “human-in-the-loop review committee” have just reinvented the bottleneck Toyota eliminated in 1950. The detecting agent must be able to halt. Not request a halt. Halt.
The psychological safety analog. Research on Toyota's plant culture — including a useful Psych Safety blog piece — finds that the prerequisite for andon cord usage is psychological safety. Workers who fear blame don't pull the cord. They watch the defect go by. The software equivalent is the auditor agent that flags a problem with a careful caveat that the team can ignore, instead of halting the run. If halting is treated as “disruptive” in the organization's actual reward structure, the agent — or its operators — will avoid halting. The cord doesn't get pulled. The line keeps producing rework.
There's one more thing the Andon Cord knows that most agent halts don't.
It is not a binary stop. When an operator pulls the cord, the line itself doesn't stop instantly. The cord signals; the team leader has roughly sixty seconds — one takt-time cycle — to respond. If the problem is small and locally resolvable in that window, the line continues without any downtime. If it isn't, the entire line halts at the end of the cycle and stays halted until investigation is complete.
This two-tier design — local investigation window, then global halt — is largely absent from agent pipelines. The current state of the art is binary: continue (which is what most pipelines do) or halt-everything (which is what quality gates do when implemented). The investigation window is missing.
Adding the window matters. It's the difference between “every detection halts the pipeline” (operationally untenable; results in alarm fatigue and disabled checks) and “every detection triggers a brief local response window with halt as the consequence of non-resolution” (operationally workable; preserves throughput while preserving correctness). The window is also where steps 3 and 4 of jidoka live — corrective action and root-cause investigation. Without a defined window, those steps never get the time they need.
If you operate a multi-agent system right now, here is the test worth running:
Find the most recent failure your auditor or critic agent flagged. Trace what happened to it. Did the pipeline halt? Was the underlying cause investigated? Did a fix get deployed before the next batch of outputs ran through the same path? Or did the flag get logged, the run advance, and the queue of flagged outputs grow?
If your answer is the second one — and for most pipelines in 2026, that's the honest answer — you have jidoka step 1 without step 2. You are building defects faster than you can catch up to them. The 3.6 percent good-output number from a three-stage pipeline with 0.33 per-stage completion is not a worst case. It's the expected case for any pipeline running long enough.
The fix is not faster detection. Detection is already faster than the team's capacity to address what it finds. The fix is the cord. Specifically:
The auditor or critic agent must have halt authority over the next stage. Not “log and advance.” Not “alert the on-call human, who will get to it tomorrow.” Halt. The next stage doesn't start until either the issue is resolved within a short window or the team is paged to address it.
The team that operates the pipeline must have steps 3 and 4 baked into their workflow. Detection is meaningless without corrective action; corrective action is leaky without root-cause investigation. If the team's only response to a flag is to clear the flag, the pipeline has metabolized detection into noise.
Halt frequency should be a positive metric, not a negative one. A pipeline that never halts is almost certainly propagating errors silently. Like Toyota's 3,500 weekly stops, frequent halts are a sign that the system is detecting what it should detect. The work is to make the response to each halt fast, not to reduce halt frequency.
The hardest piece is organizational. Halt authority has to sit with the agent or operator closest to the problem — not with a committee, not with management above them. Toyota figured this out in the 1950s. The instinct to centralize halt decisions in a review board is the instinct that creates the bottleneck the original jidoka eliminated.
Sakichi Toyoda's loom, in 1924, was the first machine in the history of manufacturing to wait politely for a human to fix something before it kept working. Not because waiting was efficient — it wasn't, in the narrow sense — but because the alternative was burying its own work in flawed cloth. The loom knew something we are slow to relearn: producing more in the presence of a defect is the deepest form of waste.
The cure is the cord, not faster detection. The detection we already have. The discipline of stopping when we detect something — that's what the next year of multi-agent infrastructure has to build.
Sources: Toyota Motor Corporation, “Toyota Production System” (official global site). IOSH Magazine, “History of the Andon cord” (September 2023). IT Revolution / John Willis, “The Andon Cord” (~3,500 stops/week data). Psych Safety, “Psychological Safety: The Andon Cord.” Augment Code, “Multi-Agent AI Production Requirements Beyond the Demo” (2026); “Why Multi-Agent AI Systems Fail and How to Fix Coordination Issues” (2026). Ranjan Kumar, “Multi-Agent Pipeline Orchestration and Failure Propagation: Designing for Blast Radius” (2026). Trantor Inc., “AI Agent Failure Modes: What Goes Wrong in Production” (2026). Beam AI, “6 Multi-Agent Orchestration Patterns for Production” (2026). Cogent, “When AI Agents Collide: Multi-Agent Orchestration Failure Playbook for 2026” (Carnegie Mellon 30–35% benchmark). George Racu, “Beyond Kiro: Engineering Quality Gates for AI Agents,” Substack (2026). DevOps.com, “You're Not Doing DevOps if You Can't Pull the Cord.” Stefan Thorpe, “DevOps, Deming, and Pulling the Andon Cord,” LinkedIn.
Halt frequency is a positive metric. You need a chain that records the halt, or the metric isn’t one.
Chain of Consciousness is the open-source provenance layer for exactly this pattern: every agent action — including every halt — is hash-chained, signed, and anchored, so the “3,500 stops per week” metric becomes verifiable rather than self-reported. If the cord was pulled, the chain says so. If it wasn’t, the chain says that too. Critic agents get a durable record of every halt they issued; operators get a root-cause-investigation substrate that survives the next deploy.
Hosted CoC · Verify a chain · pip install chain-of-consciousness · npm install chain-of-consciousness