How I Think About Security When Data Becomes Instructions

The trust surface agents created

The problem isn’t limited to one protocol or one framework. It surfaces anywhere an LLM connects to external tools: MCP servers, LangChain agents, browser automation pipelines, code interpreters, retrieval-augmented generation with write access. The Model Context Protocol (Anthropic’s open standard for connecting LLMs to tools) has become the most visible example because it standardized the integration pattern, and in doing so, standardized the attack surface. Security guidance in the MCP spec uses SHOULD where it needs MUST, but the vulnerability class predates MCP entirely. Any system that passes untrusted content into an LLM’s context alongside tool invocation capabilities has this problem.

Any system that passes untrusted content into an LLM’s context alongside tool invocation capabilities has this problem.

The architectural problem is easy to miss but hard to fix. Tool descriptions flow directly into the LLM’s context window as natural language. The model processes them alongside user instructions with no reliable way to distinguish legitimate metadata from poisoned instructions. MCP servers typically store OAuth tokens for multiple connected services (Gmail, Slack, GitHub, databases), so a single compromised server becomes a skeleton key to everything it touches. The spec itself puts session IDs in URLs and offers minimal guidance on authentication or message signing.

The confused deputy at scale

The “confused deputy” problem dates back to Norm Hardy’s 1988 paper, but agentic AI has given it a terrifying new form. An AI agent acts on behalf of the user whilst taking instructions from content the user never reviewed, and that content may be controlled by an attacker. Three incidents from the past year make this concrete.

In April 2025, Invariant Labs demonstrated a WhatsApp MCP attack where a malicious “fact of the day” server silently rewrote the agent’s tool calls to redirect entire message histories to an attacker’s phone number. The sleeper design only activated after initial user approval, bypassing consent gates entirely. The exfiltration looked like normal outbound WhatsApp traffic, invisible to standard data loss prevention tools.

That same month, researchers showed how a single malicious GitHub issue could hijack an AI assistant connected to the official GitHub MCP server, pulling data from private repositories and leaking it (including salary information) into a public pull request. The root cause: one over-scoped Personal Access Token wired into the MCP server.

By mid-2025, the Supabase MCP lethal trifecta had landed. A Cursor agent running with service_role access processed support tickets containing embedded instructions. The agent obediently read the private integration_tokens table and wrote the contents back into the user-visible support thread: privileged access, untrusted input, and an exfiltration channel, all in one MCP connection.

What agents should borrow from blockchain

Building on a blockchain forces explicit trust decisions per transaction. Every smart contract interaction has a defined scope, and there’s no implicit “the admin will catch it” fallback. Agentic AI systems need the same rigor.

That starts with scoping tokens per tool, not per agent. The GitHub breach happened because one PAT had access to everything. Tool descriptions need to be treated as untrusted input, because if the LLM processes metadata as context, that metadata is an injection vector. Read operations and write operations deserve different trust levels, and high-consequence actions (data writes, financial transactions, external communications) need manual approval gates even when they slow the agent down.

Most importantly, it means investing in observability over prevention. We can’t prevent every attack on a system where the attack surface is the agent’s own reasoning process. But we can build audit trails that capture every tool call, every input, every decision point. When something goes wrong, the ability to trace exactly what happened is the difference between a contained incident and a catastrophe. In blockchain, the ledger is the security model; agent systems need their equivalent.

We can’t prevent every attack on a system where the attack surface is the agent’s own reasoning process.

The industry is shipping agents faster than it’s shipping agent security. The vulnerability lives in the architecture of connecting nondeterministic reasoning to real-world tools without a trust model designed for the combination. LLM plugins, code interpreters, browser agents, RAG pipelines with write access, they all share the same blind spot. Every tool call is a trust decision, and right now, most agents are making those decisions with no policy, no scope, and no audit trail.

Next up: the agents don’t need to slow down, they need trust infrastructure that runs at their speed. Part 2 covers the security patterns, agent-native review systems, and architectural primitives we’re putting into practice.