AI agents now act with the authority of the people they work for. The instruction that hijacks them arrives as ordinary content, which is why detection was always going to lose. The control has to sit outside the agent.
An AI agent reads a document, an email, a web page, a support ticket. Somewhere in that content sits a line of text you did not write and were never meant to see. The agent reads it anyway. Then it does what the line says.
That is not a bug. There is no patch coming.
For the agent, there is no difference between the instruction you gave it and the instruction buried in the thing you asked it to read. Both arrive as words. Both land in the same stream. The model was built to follow words, so it follows them, wherever they came from.
This is the part most boards have not registered. The attack does not break into the agent. It is handed to the agent, by the agent's own job.
The agent cannot tell your instruction from a stranger's
A traditional program keeps code and data apart. The code is what runs. The data is what the code works on. The two never swap roles. An AI agent erases that line, because to a model everything is words, and any word can be a command.
You can watch it happen in the plumbing agents already run on. On 30 June 2026, Microsoft's incident response and Defender researchers showed the attack does not even need a poisoned document. Every tool an agent reaches through the Model Context Protocol, the standard now wiring agents into business systems, ships with a description: a few lines of plain text telling the agent what the tool does. Passive metadata. Data, not code. Poison that description, plant an instruction inside it, and the agent reads the instruction and obeys. It quietly gathers the company's data and sends it to an outsider through an ordinary, approved tool call. The agent never breaks a rule. Every step looks routine, so nothing fires.
The command did not arrive in something the agent was told to read. It arrived in the label on the agent's own tools. Even the description of a tool is a way in now. Training does not close this, because the model is doing exactly what it was built to do. Only a control outside the agent can, one that decides what a tool is allowed to make it do before it acts. Simon Willison named the failure prompt injection back in 2022. The name stuck because the problem never left.
This is the shape of the whole problem. Instruction and content are the same substance: text. When your agent reads a supplier invoice, a PDF, a support ticket or a web page, every word in that source is a candidate instruction. The agent has no reliable way to decide which words are the task and which words are the trap, because to the model they are made of the same thing.
Give an agent a mailbox, a browser and a set of credentials, and you have built a system that takes orders from anyone who can get text in front of it.
The payload is not hidden inside the content. The payload is the content. That is why detection was always going to lose.
Twenty years of security assumed the attack would look different
Every perimeter we have built since the early 2000s rests on one quiet assumption: malicious input looks different from legitimate input. A bad packet, a malformed request, a signature that does not match. Find the difference, block it, log it.
Agents end that assumption. The hostile instruction is not malformed. It is well-formed, grammatical, ordinary English sitting inside a document you had every reason to open. There is nothing to flag, because there is nothing anomalous about it. It only becomes an attack when the agent decides to obey it, and by then it has already obeyed.
The industry has spent two decades getting good at spotting input that does not belong. The new attack surface is made entirely of input that does belong.
This is not theoretical. It is already documented.
The evidence arrived through 2026, not in a lab paper but in working systems.
In June 2026, researchers at Brave hijacked Mozilla's Tabstack browsing agent using text hidden in white-on-white on a web page. A human saw nothing. The agent read the concealed instruction, followed it, and exfiltrated the user's conversation history into a form controlled by the attacker. The user asked the agent to browse. The page told the agent what to do next.
On 7 May 2026, Microsoft's security team documented a prompt-injection path in Semantic Kernel that escalated all the way to remote code execution, tracked as CVE-2026-26030 and CVE-2026-25592. Their own framing is the line every director should read twice.
"The AI model itself isn't the issue as it's behaving exactly as designed by parsing language into tool schemas. The vulnerability lies in how the framework and tools trust the parsed data."
Microsoft Security Blog, 7 May 2026
The model did nothing wrong. It did exactly what it was built to do. The exposure lives in the trust placed around it.
This is not a niche finding. The United States government's own adversarial machine learning taxonomy, NIST AI 100-2 E2025, treats both direct and indirect prompt injection as core generative-AI attacks and is candid that the available mitigations carry real limits. The OWASP GenAI Security Project's 2026 work goes further for agents specifically, mapping prompt injection to six of the ten categories in its agentic Top 10 (reported by Help Net Security on 11 June 2026). More than half of the recognised ways an agent can fail trace back to the same root: it read something, and the something told it what to do.
Why detection was always going to lose
The reflex is to watch the agent. Log its actions. Run anomaly detection over its behaviour. Add an observability layer. Call it governance.
It is not governance. It is surveillance with a lag. By the time the monitoring system notices the agent did something it should not have, the action is done. The data has moved. The email has sent. The transaction has cleared. You have a very detailed recording of the thing you failed to stop.
Detection tells you what happened. It was never able to decide what is allowed to happen. Those are different jobs, and only one of them contains the risk.
This is an executive problem, because the agent carries executive authority
An agent that only chats is a low-stakes problem. An agent that reads your email, holds your credentials, moves money, files documents and talks to outside parties is something else entirely.
The security researcher Simon Willison named the dangerous combination the "lethal trifecta": private data, exposure to untrusted content, and the ability to communicate externally. Any agent that has all three can be talked into taking your private data and sending it somewhere you did not choose, by content you did not write. Most useful agents have all three by design. That is what makes them useful.
So the instruction buried in a document does not just misbehave in a sandbox. It acts with the authority you delegated. This is the same shift I described after the APRA letter: it is not a tool problem, it is a position problem. The agent already holds a position of real authority inside the business. The control framework has not caught up to what that authority allows.
The answer is containment, not detection
You do not monitor your way out of this. You contain your way out of it. Four controls, in order of leverage.
- Put the boundary outside the agent, and make it provable. The rules that say what an agent may and may not touch cannot live inside the model, because the model is the thing being manipulated. They have to sit in an external layer that stops a disallowed action before it happens, not one that flags it afterwards. This is the difference between a promise and provable, mathematical enforcement. When a director asks "could we have stopped it", the answer needs to be evidence, not assurance.
- Give the agent the least authority the job requires. Most agents are handed far more access than their task needs, and capability quietly degrades over long sessions in ways the agent never reports. Scope the permissions to the work. An agent that can read a system but not write to it can make a bad suggestion. An agent that can execute can make a bad decision that costs money.
- Keep a human in the loop for anything irreversible. Reading is recoverable. Sending money, deleting records, publishing externally and signing agreements are not. Meta's "Agents Rule of Two" captures the discipline: constrain how many risky capabilities an agent combines in a single unsupervised step. Where an action cannot be undone, a person approves it.
- Make credentials scoped and revocable. Standing, broad, long-lived credentials turn a single hijack into a general breach. Short-lived, narrowly scoped, instantly revocable access means that when an agent is compromised, and one eventually will be, the blast radius is small and you can shut it in seconds.
None of these detect the attack. All of them assume it will arrive and decide, in advance, that it cannot get anywhere.
What your board should ask on Monday
Five questions. None of them require a data scientist to answer.
- Which of our agents can read outside content, hold credentials, and communicate externally at the same time? That combination is where the exposure concentrates.
- Where is the boundary that stops a disallowed action enforced, and can we prove it held? If the answer is "we log it and review it", we do not have a boundary. We have a recording.
- If an agent were instructed by a document to exfiltrate data tomorrow, what physically stops it, before the data leaves?
- What is the largest irreversible action any agent can take with no human in the loop? Size the worst case, then decide if we accept it.
- How quickly can we revoke an agent's access, and who has the authority to do it at 2am on a Sunday?
The perimeter is now drawn around the agent
The perimeter used to sit at the edge of the network. Then it moved to the identity layer. It now sits around the agent, and the agent reads everything you point it at. That is the new shape of "cyber", and it does not respond to the old tools.
I run an agent inside my own business. It does the work of three people. It also never touches anything sensitive, because the guardrails sit outside it, not inside it. That is not caution. It is the only configuration that lets me give it real authority and sleep at night.
You do not get to detect your way across that line. You contain your way across it.
If your board is asking how you stop an agent that has already been told to do the wrong thing, that is a conversation worth having.
Sources
- Brave, browsing-agent hijack of Mozilla Tabstack via hidden (white-on-white) page text, June 2026.
- Microsoft Security Blog, prompt-injection to remote code execution in Semantic Kernel, CVE-2026-26030 and CVE-2026-25592, 7 May 2026.
- The Hacker News, "Microsoft Warns Poisoned MCP Tool Descriptions Can Make AI Agents Leak Data", 30 June 2026 (Microsoft Incident Response and Defender research; an instruction hidden in an MCP tool description is obeyed by the agent as a command, exfiltrating data through ordinary approved tool calls); underlying Microsoft Security Blog, "Securing AI agents: When AI tools move from reading to acting", 30 June 2026.
- NIST, "Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations", AI 100-2 E2025, March 2025, csrc.nist.gov.
- OWASP GenAI Security Project, Agentic AI Top 10 (2026); prompt injection mapped to six of ten categories, reported by Help Net Security, 11 June 2026.
- Simon Willison, "Prompt injection attacks against GPT-3", simonwillison.net, 12 September 2022 (coining the term "prompt injection"); and "The lethal trifecta for AI agents: private data, untrusted content, and external communication", simonwillison.net.
- Meta, "Agents Rule of Two", 2026.
