In April 1989, the Cardiac Arrhythmia Suppression Trial was halted early. The drugs worked. Encainide and flecainide were doing exactly what cardiologists had designed them to do: they suppressed the irregular heartbeats — premature ventricular contractions, or PVCs — that had been showing up in heart-attack survivors’ ECGs for years. The metric improved. The patients died.

Patients on active drugs died at 2.38 times the rate of patients on placebo, with a 95 percent confidence interval of 1.59 to 3.57 (NEJM, 1991; PubMed 2473403). The relative risk for arrhythmic death specifically was 2.64. By the early 1990s, researchers were estimating that anti-arrhythmia drugs of this class had been killing tens of thousands of American heart-attack patients per year — for years — because the easy-to-measure proxy (PVC suppression) and the outcome that mattered (whether the patient stayed alive) were not just uncorrelated. They were inversely correlated.

This is the streetlight effect at its most lethal. The most disturbing part of it is not that the doctors were stupid or careless. It is that they were doing exactly what the entire architecture of medical research rewarded: measuring what was measurable, publishing what was publishable, prescribing what was prescribable. The lamp was bright. The keys were elsewhere.

The drugs worked perfectly on the metric that did not matter. Heartbeats steadied. Patients died. The light was better here.

The drunkard, the cop, and the Sufi

There is a joke about a drunkard who has lost his keys. A passing police officer finds him searching for them under a streetlight. “Did you lose them here?” the officer asks. “No,” says the drunkard. “I lost them over there. But the light is better here.”

The earliest known American printing of the joke is in the Boston Herald in 1924, about a man searching Copley Square for a lost two-dollar bill he had dropped on darker Atlantic Avenue (Quote Investigator, 2013). Long before that, a Sufi parable had Mulla Nasreddin searching the sunlit yard for a ring he had lost inside a dark room (Idries Shah, The Exploits of the Incomparable Mulla Nasrudin, 1966). In 1964, the philosopher Abraham Kaplan formalized the joke in The Conduct of Inquiry and called it “the principle of the drunkard’s search.” Kaplan’s argument was uncomfortable: behavioral scientists, he wrote, habitually formulate problems to fit available methods rather than developing methods to fit important problems. It was, he admitted, “a very human trait.”

Robert Jervis extended the concept to political analysis in 1993, showing how Cold War analysts compared U.S. and Soviet capabilities by counting weapons because counting was easy, even though weapon counts had only a tenuous relationship to actual military outcomes. David H. Freedman gave the bias the name we use now in his 2010 book Wrong. Once you know the shape, you start seeing it in places where the joke is no longer funny.

More data, fewer keys

Here is the result that should keep technical leaders up at night: in 2024, four economists ran a careful experiment on what happens when you give researchers more data about which research directions look promising (Hoelzemann, Manso, Nagaraj, and Tranchero, NBER Working Paper 32401). They studied exploration under uncertainty in a controlled setup, then validated their findings against decades of genetic-research history.

Providing data on the true value of one project hurt individual payoffs by 12 percent. It reduced the group’s likelihood of discovering the optimal outcome by 48 percent. In the genetic-research field analysis, diseases with early evidence of promising genetic targets were 16 percentage points less likely to yield breakthroughs than diseases where early efforts had failed.

The mechanism is mostly free-riding. When the data illuminates an attractive path, everyone crowds onto that path. Nobody bothers to wander into the dark, because the dark looks unpromising relative to the lit thing in front of them. The shared, public, illuminated dataset — the thing we have all been told makes science go faster — actively suppresses the variance of search behavior that produces breakthroughs.

This is not an argument against data. It is an argument against confusing the available with the important. The streetlight does not just bias your search. Under some conditions, it makes the keys harder to find than if you had no light at all.

The lamp generates its own light

The Hurricane Sandy story is the streetlight effect at social-media scale. In late October 2012, when Sandy slammed into the U.S. Northeast, real-time analyses of geotagged tweets showed an enormous spike of disaster-related social media activity coming from Manhattan. A naive read of the data: Manhattan was hardest hit. The actual meteorological and damage data: the New Jersey shore took the worst of the storm, particularly poorer and older communities along the coast (The Conversation, 2016).

Manhattan was screaming loudest because Manhattan had the densest concentration of young, tech-savvy, English-speaking, smartphone-equipped users. The hardest-hit communities skewed older, lower-income, less plugged in. The platform’s user base was the lamp. If you had allocated emergency response by tweet volume — and people tried — you would have systematically neglected exactly the people who needed help most.

Google Flu Trends made the failure mode even cleaner. Launched in 2008 as a way to predict influenza prevalence from search queries, it gave estimates roughly twice as high as actual CDC reports during the 2013 flu season (The Conversation, 2016). The cause was not random noise. The model could not distinguish people searching about flu because they had it from people searching because flu coverage was on cable news. Media coverage drove searches, which drove “flu signal,” which drove media coverage. The lamp was generating the light it then measured. Google retired the public Flu Trends product in 2015. The retirement, like the CAST halt, deserves to be on a plaque somewhere.

The lamp’s geography is older than you think

In 2017, the political scientist Cullen Hendrix published a study in Global Environmental Change asking a quietly devastating question: which African countries get studied for climate change, and why? You might assume the answer would be “the most vulnerable ones.” It is not.

Hendrix found that the predictors of academic attention on a given African country were not exposure to climate effects or weakness of adaptive capacity. They were British colonial history, civil liberties, and political stability — in other words, the conditions under which English-language data infrastructure, professional networks, and field access happen to exist (Hendrix, Global Environmental Change 43:137–147, 2017). Of the 20 countries judged most at risk from climate change as of 2015, none appeared in the top 10 of the climate-conflict literature — a finding the Wilson Center’s New Security Beat later highlighted in its 2018 analysis of Hendrix’s data.

This is where the streetlight effect stops looking like a research-methodology issue and starts looking like a moral one. The downstream consequences include misallocated funding, distorted policy, and the steady reinforcement of a research record that systematically underrepresents the countries most likely to need its findings. The lamp’s geography was set in the nineteenth century, and the academy is still standing under it.

Education knows but cannot act

The most poignant version of the streetlight effect I know about comes out of Harvard’s SEED Lab. In 2024, Ahun and colleagues published a review in American Psychologist (PubMed 39418471) examining outcome measures used in evaluations of interventions for preschool-age children since 1990. Of the measured outcomes, 49.1 percent focused on academic skills — reading, math, letter recognition, the things that fit on a standardized test. Less than 13 percent measured what the SEED group calls Foundations of Learning and Development: curiosity, creativity, self-regulation, critical thinking, perspective-taking.

In a separate qualitative study with 60 diverse community experts — parents, educators, clinicians, early-childhood-care leaders — participants universally acknowledged that FOLD skills were central to child development and well-being (Ahun et al., Annals of the NYAS, 2024; PubMed 39656867).

So everyone agrees. Curiosity and self-regulation matter more than letter recognition. The funders know. The teachers know. The parents know. The children, if you asked them, would tell you. And the field measures the letter recognition because the letter recognition is what the lamp is on. Knowing about the bias does not, by itself, fix it. The lamp is institutional, not psychological.

The new lamp

The streetlight effect now also runs on training data. Common Crawl, the open web archive that underlies a large fraction of large-language-model pretraining (hundreds of billions of pages since 2008), is heavily skewed toward English-language content (Nature, d41586-025-03891-y, 2025). The downstream consequence is well documented: models perform worse in low-resource languages, miss culturally specific knowledge, and reproduce the patterns of an internet that was never a representative sample of human thought to begin with.

There is a second, subtler version of the same problem in agent design. An agent equipped with a specific toolset will search inside that toolset’s capabilities, even when the answer is somewhere the toolset does not reach. If your agent has a SQL tool but no API tool, it will phrase the user’s question as a SQL question. If it has both, it will pick the one whose return shape it can parse. The shape of the toolbox shapes the shape of the search. Most agent observability today watches which tools were called and how often they succeeded. It does not, generally, watch the questions that never got asked because the right tool was missing.

A blunt translation for engineering teams: your dashboards are your lamp. The systems that take you down catastrophically are almost always the ones that fall outside your monitoring framework. The service nobody instrumented. The dependency nobody tracked. The edge case nobody logged. If you have ever been on a postmortem call where the first ten minutes are spent figuring out which graph could even tell us what happened, you have been the drunkard.

Where the framework breaks

Three honest objections, before the practical part.

“Starting where the data is, is rational.” It often is. You cannot search everywhere at once, and starting under the lamp is a defensible heuristic — if you intend to move. The pathology is not starting under the lamp; it is staying there because the institutional reward structure (publishable papers, green dashboards, promotable wins) makes leaving costly. As Barbara Evans put it in her 2020 review of three streetlight effects in genomics regulation: “Just because we have a regulatory solution, this does not mean we understand the problem” (Journal of Law, Medicine and Ethics 48(1):105–118).

“Data-driven research has produced enormous breakthroughs.” It has. Genome-wide association studies, modern astronomy, computational materials science — all are arguments for following the lamp. But GWAS also illustrates a second-order streetlight: conditions with large available datasets (overwhelmingly European-ancestry populations) get studied disproportionately, while health disparities in underrepresented populations persist in the dark. The lamp can produce real victories and a systematically distorted record of where victory mattered.

“Better technology keeps widening the cone of light.” True, and important. Every new instrument expands what can be seen. But every expansion creates a new periphery, and partial illumination at the edge can be more dangerous than honest darkness — because it creates false confidence. The unknowns we are about to discover are usually less dangerous than the unknowns we cannot yet name.

What you can actually do about it

Skip the abstract advice. Three concrete moves, each implementable in a single working session.

Red-team the dashboard. Pull up your team’s top monitoring view. Ask, deliberately and uncharitably: what would I expect this to look like during a serious failure that the dashboard does not cover? If you cannot name a class of incident that would not show up here, you are watching the lamp, not the failure surface. The Cold War example Jervis dissected — counting Soviet warheads because warhead counts were enumerable, while command-and-control reliability resisted quantification — is the same pattern at strategic scale. Make a list of the unmeasured failure classes. Pin it next to the dashboard.

Pre-register the question, not the dataset. Before you go looking, write down what you are trying to learn and what observation would change your mind. Then start the search. This is the lab-science version of pre-registration, but it works equally well for product analytics, security investigations, and incident retrospectives. The point is to bind your future self to the question your past self thought was important, before the convenient dataset has had a chance to seduce them.

Pay people to look in the dark. This one is structural. If your organization rewards only the kind of work that produces measurable wins inside the existing instrumentation, no individual heroics will move the search beam. Carve out a non-trivial slice of effort — the NSF’s EAGER program is one institutional model — for projects whose explicit purpose is to test whether the unmeasured things might matter. Treat negative results as deliverables. The Hoelzemann study’s quietest finding is the loudest one for managers: in their setup, competition attenuated the streetlight effect but did not eliminate it. The thing that consistently helped was structurally rewarding exploration of the dark.

The lamp this essay is on

A confession. This essay is itself a streetlight. The cases I picked — CAST, NBER, Sandy, Hendrix, SEED Lab, Common Crawl — are exactly the cases that have well-documented numbers, peer-reviewed studies, and English-language reporting. Somewhere in the unmeasured dark are versions of the same effect that I cannot name, in domains and languages I do not have access to, with a body count that does not appear in NEJM or NBER working papers.

The drugs worked perfectly on the metric that did not matter. Heartbeats steadied. Patients died. The light was better here.


Sources: CAST trial, NEJM 1991 (PubMed 2473403); Kaplan, The Conduct of Inquiry, 1964; Idries Shah, The Exploits of the Incomparable Mulla Nasrudin, 1966; Quote Investigator, 2013; Jervis 1993; Freedman, Wrong, 2010; Hoelzemann, Manso, Nagaraj, Tranchero, NBER Working Paper 32401, 2024; The Conversation, 2016 (Hurricane Sandy / Google Flu Trends); Hendrix, Global Environmental Change 43:137–147, 2017; Wilson Center New Security Beat, 2018; Ahun et al., American Psychologist, 2024 (PubMed 39418471); Ahun et al., Annals of the NYAS, 2024 (PubMed 39656867); Nature d41586-025-03891-y, 2025 (Common Crawl); Evans, Journal of Law, Medicine and Ethics 48(1):105–118, 2020.

Watch the questions, not just the tool calls

The closing observation in this essay is the practical one: agent observability today watches which tools were called and how often they succeeded. It does not, generally, watch the questions that never got asked because the right tool was missing. Chain of Consciousness is the audit trail that records the agent’s reasoning chain — the alternatives considered, the path taken, the path foreclosed — not just the call that landed. It is the lamp, pointed at the dark.

pip install chain-of-consciousness
npm install chain-of-consciousness

Or run it as a service: Hosted Chain of Consciousness ships the same provenance with no install. The CAST trial took years and tens of thousands of deaths to discover the wrong metric was lit. Your agents do not have years.