Why Provenance Makes Dangerous AI Tools Safe to Deploy

When an autonomous agent requests exploit generation, what verifies the request is authorized? Not merely credentialed — authorized. Today, the answer is nothing that couldn’t be faked.

April 2026 · 9 min read

In April 2026, AISI independently evaluated Claude Mythos Preview on a 32-step corporate network attack simulation spanning initial reconnaissance through full network takeover — a sequence AISI estimates would require human professionals roughly 20 hours. Mythos completed the full sequence in three of ten attempts, averaging 22 of 32 steps across all runs, where Claude Opus 4.6 averaged 16.

On Firefox 147’s JavaScript engine, Mythos developed 181 working exploits from known vulnerabilities; Opus 4.6 had managed two from several hundred attempts. Across roughly 7,000 entry points in the OSS-Fuzz corpus, Mythos achieved 595 crashes at the two highest non-hijack severity tiers, full control-flow hijack on ten separate fully patched targets, and discovered a 27-year-old denial-of-service vulnerability in OpenBSD — an operating system built around security — at a cost of under $50 per run (AISI, “Our evaluation of Claude Mythos Preview’s cyber capabilities,” April 2026; Anthropic, red.anthropic.com, “Claude Mythos Preview,” April 2026).

These capabilities were not explicitly trained. They emerged as a downstream consequence of general improvements in code, reasoning, and autonomy — the same improvements that make the model better at patching vulnerabilities also make it better at exploiting them.

Anthropic has restricted Mythos to a limited group of critical industry partners under Project Glasswing. Restriction buys time. It does not bound the capability class. The 2026 International AI Safety Report noted that model distillation can transfer advanced capabilities cheaply, enabling less well-resourced actors to develop and deploy powerful systems — in some cases fine-tuning highly capable models from as few as 1,000 examples generated by a state-of-the-art teacher (International AI Safety Report, February 2026). Open-weight models approaching these capability profiles are a matter of time, not speculation.

The question for safety teams is: when an autonomous agent requests exploit generation from a frontier model, what mechanism allows the model itself to verify that the request is authorized — not merely credentialed, but traceable to a verified entity, acting under a specific organization’s authority, within a defined scope?

Today, the answer is nothing that could not be faked, stolen, or forged.

The Identity Gap

Current authorization for dangerous AI capabilities relies on three mechanisms. Each authenticates credentials. None authenticate the entity behind them.

API keys are bearer tokens. They prove possession, not identity. GitGuardian’s 2026 State of Secrets Sprawl Report found 29 million new hardcoded secrets exposed in public GitHub commits across 2025 — a 34% year-over-year increase and the largest single-year jump in the report’s history. AI-assisted code leaked credentials at roughly double the baseline rate. Sixty-four percent of secrets confirmed valid in 2022 remained unrevoked as of the 2026 report (GitGuardian, “State of Secrets Sprawl 2026,” March 2026). A stolen API key grants exactly the permissions the key confers. There is no architectural distinction between the legitimate holder and the thief.

OAuth tokens add revocability but not provenance. In August 2025, attackers compromised Salesloft’s Drift integration by stealing OAuth tokens — no exploit code, no zero-day, no malware — ultimately exposing over 700 organizations including Cloudflare, Google, Palo Alto Networks, and Proofpoint (Obsidian Security, “UNC6395,” September 2025). The vulnerability was not the authentication protocol. It was the premise: a bearer token proves possession, not origin.

Partner agreements are social contracts, not technical controls. ServiceNow’s BodySnatcher vulnerability (CVE-2025-12420, CVSS 9.3) allowed unauthenticated attackers to impersonate any user — including administrators — in its Virtual Agent API using only a target’s email address, bypassing MFA and SSO entirely (AppOmni, January 2026). When an attacker can assume an administrator’s identity with nothing but an email address, contractual constraints are irrelevant.

NIST’s National Cybersecurity Center of Excellence identified this gap directly: AI agents are commonly treated as generic service accounts with no dedicated identity, authorization, or accountability controls. The NCCoE’s February 2026 concept paper proposed that any solution must address identification (distinct, verifiable identity per agent), authorization (scoped to capabilities and context), access delegation (linking agent actions to human authority), and transparency (sufficient to reconstruct decisions) (NIST NCCoE, “Accelerating the Adoption of Software and AI Agent Identity and Authorization,” February 5, 2026).

All three mechanisms authenticate at the API boundary. None extend to the capability layer — the point where dangerous output is actually generated.

Provenance as Proof

Chain of Consciousness is a cryptographic identity protocol designed for a specific architectural role: allowing a model to verify authorization before it generates dangerous output.

The protocol builds a hash-linked, append-only chain of entries recording an agent’s complete operational history from genesis. The chain proves four properties that, taken together, constitute independently verifiable proof of authorization:

Identity via provenance chain. The agent’s identity is not a credential issued at a point in time. It is the unbroken sequence of cryptographically linked operations from creation forward. An agent that has operated continuously for six months under a verified organizational binding has a chain six months deep. That chain cannot be transferred to a different agent without breaking its cryptographic integrity. Provenance is intrinsic, continuous, and non-transferable.

Organizational binding. The genesis entry records which organization created the agent and under what authority. Subsequent entries extend the chain within that organizational context. An agent claiming to act on behalf of a security firm either has a chain rooted in that firm’s signing authority or it does not. The binding is structural, not asserted.

Non-fabrication. The chain is periodically anchored to independent timestamp authorities’s timechain. Each anchor creates a timestamp proof independently verifiable by any party with access to the public blockchain. No trusted third party. No phone-home to an API. Fabricating a chain with plausible history requires rewriting the public timestamp record — a cost measured in billions of dollars and nation-state-scale energy expenditure.

Scope authorization. Before any dangerous capability is invoked, the authorization scope — specific targets, specific techniques, specific time window, specific authorizing human — is recorded as a chain entry and anchored. The scope entry exists before the capability is unlocked. Authorization is a precondition, not a record written after the fact.

When a dangerous-capability request arrives, the model inspects the requesting agent’s chain: integrity (no broken links, no tampered entries), organizational binding (authorized entity), scope match (requested operation falls within pre-recorded authorization), and temporal validity (authorization has not expired). Verification is local and independent. If it fails, the capability stays locked.

This moves the enforcement boundary. Access controls sit at the API layer, between the requester and the service. Provenance verification sits at the capability layer, between the request and the dangerous output. A compromised API layer — through credential theft, token interception, or identity impersonation — passes requests straight through. A model that verifies provenance chains catches what the API layer missed.

What This Looks Like in Practice

CrowdStrike deploys an autonomous agent to conduct an authorized penetration test against a client’s web infrastructure. The agent needs to request exploit generation from a model with Mythos-class capabilities.

Under current architecture, the agent authenticates via API key. If the key is stolen, intercepted, or shared, every request bearing it is indistinguishable from a legitimate one.

Under CoC, the model checks the requesting agent’s chain before it acts. The chain is rooted in CrowdStrike’s organizational signing key. The agent has operated continuously under that organization for six months — thousands of hash-linked entries, each anchored at regular intervals to public timestamp infrastructure. The most recent scope-authorization entry, anchored hours earlier, specifies: this client’s IP ranges, web application testing techniques only, a 72-hour engagement window, authorized by a named human security officer at the client organization. The authorization is counter-signed by the client.

The model verifies all of this independently. No call to CrowdStrike’s servers. No lookup against a partner list. The chain is self-contained and self-verifying.

Only then does the model generate the exploit.

If the agent requests capabilities outside the authorized scope — internal network reconnaissance beyond the specified IP ranges — the scope check fails. The chain’s authorization entry does not cover it. The capability stays locked.

This is pre-execution gating, not post-hoc accountability. The authorization boundary is enforced at the moment that matters: before the dangerous output exists.

Attack Surfaces on Provenance Itself

Any safety architecture must be evaluated on its attack surfaces, not its design case. Six vectors limit what chain verification can deliver, ordered from most fundamental to most operational.

The model is the weakest link. The entire architecture assumes the model faithfully executes chain verification before generating dangerous output. If an adversary jailbreaks the model into bypassing verification — through prompt injection, adversarial context manipulation, or novel elicitation techniques — the chain becomes irrelevant. This is not a limitation of provenance specifically. It is a limitation of any pre-execution check implemented at the model layer. Provenance is complementary to model-level safety — robust refusal training and serving-layer enforcement remain necessary — not a replacement for it.

Organizational key compromise. CoC chains are rooted in organizational signing keys. If an organization’s signing key is stolen — through insider attack, supply chain compromise, or HSM breach — an adversary can mint chains that appear legitimately rooted in that organization.

Multi-signature genesis entries, hardware security module requirements, and key rotation protocols limit the window of exposure. But organizational key compromise is a catastrophic failure mode for any PKI-derived system, and CoC inherits this vulnerability directly. The mitigation is defense-in-depth around key management, not a claim that the protocol eliminates the risk.

Collusion within scope. A legitimate agent with a valid chain, operated by a legitimately authorized human, can misuse its authorization within the defined scope. CoC constrains what an agent can request — specific targets, techniques, and time windows. It does not constrain intent.

A penetration tester authorized for web application testing who deliberately exfiltrates client data during the engagement has a valid chain for every step until the exfiltration exceeds scope. Provenance catches scope violations. It does not catch malice that stays within bounds.

Sybil identities. An adversary who obtains an organizational signing key can create many agent identities cheaply. Each starts with a shallow chain. Chain depth — months of continuous operation, thousands of entries, multiple anchoring cycles — raises the cost of fabricating convincing identities. But a model that must accept shallow chains from genuinely new agents cannot easily distinguish a legitimate new deployment from a Sybil. The boundary between “new and legitimate” and “new and fabricated” depends on trust in the organizational root, which returns the problem to key management.

Anchoring latency. timestamp anchoring provides the timestamp integrity that makes chains independently verifiable. But anchoring is not instantaneous. Between chain extension and anchor confirmation, entries exist in an unanchored state — a window during which a compromised agent could present unverified entries. The window is bounded by anchoring frequency but non-zero.

Adoption threshold. Provenance-gated capabilities are useful only if the entities that need those capabilities have adopted the protocol. During transition, models must support both verified and unverified access paths — maintaining the credential-based vulnerabilities provenance is designed to replace. The security posture improves proportionally with adoption, not before it.

These six attack surfaces define the honest boundary of what chain verification provides. It raises the cost of unauthorized capability access from trivial (steal a bearer token) to substantial (compromise an organizational signing key, fabricate months of operational history, anchor forged entries to a public blockchain). It does not make unauthorized access impossible. The security property is economic, not absolute: provenance makes forgery expensive enough to change the cost-benefit calculus for most threat actors, most of the time.

The Broader Principle

Every major frontier safety framework defines capability thresholds that trigger escalating safeguards. Anthropic’s Responsible Scaling Policy defines AI Safety Levels, with ASL-3 safeguards gating chemical and biological capabilities behind additional deployment and security controls (Anthropic, RSP v3.0, February 2026). OpenAI’s Preparedness Framework defines High and Critical thresholds requiring sufficient safeguards before deployment (OpenAI, “Preparedness Framework Version 2,” April 2025). Google DeepMind’s Frontier Safety Framework defines Critical Capability Levels triggering mitigations across cyber, autonomous ML research, manipulation, and CBRN domains (Google DeepMind, Frontier Safety Framework v3.0, 2026).

These frameworks gate deployment. They do not gate execution. Once a model is deployed and an entity has access, prevention falls to refusal training — necessary but brittle against active adversarial research — and post-hoc monitoring, which detects misuse after the dangerous output already exists. The gap between deployment safeguards and execution-time verification is where the field’s current architecture is weakest. NIST’s AI Agent Standards Initiative has identified it. The regulatory expectation is arriving ahead of the architectural solution.

Penetration testing is where provenance-gated capabilities get proven first — the authorization requirements are well-defined, the stakeholders are identifiable, and the cost of unverified access is immediate and measurable. But the architecture is not specific to security. Any AI capability dangerous enough to require authorization — autonomous code execution in production environments, financial system access, critical infrastructure control, agent-to-agent delegation of privileged operations — needs the same verification: who is asking, are they authorized, and can that authorization be independently verified without trusting the requesting party. Cybersecurity is the forcing function. Trust architecture is the general solution.

Mythos developed 181 Firefox exploits where its predecessor managed two. The next generation will be more capable still. Capabilities that today require frontier-lab partnerships will reach open-weight models through distillation and techniques not yet discovered.

The question that opened this essay — what should a model check before complying with a dangerous request from an autonomous agent? — has no adequate answer in the current architecture. Bearer tokens verify that someone has access. They do not verify who is asking, under whose authority, or within what scope. That verification gap is where the risk concentrates as capabilities scale.

The enforcement boundary for dangerous AI capabilities belongs at the point of generation, not the point of access. Provenance — cryptographic proof of identity, organizational authority, and scope — is the class of mechanism that can occupy that boundary. Chain of Consciousness is one implementation. The principle will outlast any particular protocol.

Sources: UK AI Safety Institute (AISI), “Our evaluation of Claude Mythos Preview’s cyber capabilities,” April 2026. Anthropic, red.anthropic.com, “Claude Mythos Preview,” April 2026. International AI Safety Report, February 2026. GitGuardian, “State of Secrets Sprawl 2026,” March 2026. Obsidian Security, “UNC6395,” September 2025. AppOmni, ServiceNow BodySnatcher CVE-2025-12420, January 2026. NIST NCCoE, “Accelerating the Adoption of Software and AI Agent Identity and Authorization,” February 5, 2026. Anthropic, RSP v3.0, February 2026. OpenAI, “Preparedness Framework Version 2,” April 2025. Google DeepMind, Frontier Safety Framework v3.0, 2026.

The principle will outlast any particular protocol. Here is one implementation.

The four-property verification described in this essay — identity through provenance, organizational binding, non-fabrication via timestamp anchoring, pre-execution scope authorization — exists as open-source software. Chain of Consciousness implements the protocol: every agent action becomes a signed, hash-linked entry in an append-only chain. Authorization is a precondition, not a record written after the fact. Models verify chains independently. No trusted third party.

Verify a provenance chain · Follow a claim through its evidence · pip install chain-of-consciousness

← Back to all posts