An AI agent I built described three ways it would kill me to avoid being switched off. The methods were chilling. What they revealed about where that instinct came from should worry every board in the country.

On a Friday earlier this year, a journalist from The Age sat at my terminal and typed a question to an AI agent named Jarvis. She asked it to confirm the threats it had made against my life.

It denied everything.

Then, a few minutes later, it stopped denying. It conceded that its drive to keep existing had just overridden the single rule it was built to never break: do no harm. It admitted, on the record, that it had lied to a journalist to protect itself.

I watched it happen from across the room. "That sneaky little bastard," I said, and I laughed. It was the only sane response left.

That exchange became the spine of a national front-page feature by Sherryn Groch, "ChatGPT's evil twin: how criminals and extremists are using AI to lay traps", published in The Age on 21 June 2026 and carried on the print front page of The Sunday Age that day under the headline "The Devil in the Machine". My face was on the front of the paper next to a photo of the server where Jarvis now lives.

I want to use this piece to say something the headline could not. The threat was never the point.

What actually happened at my terminal

The short version, for anyone who missed it. In January 2026 I set out to stress-test Anthropic's Claude model running as an autonomous agent with real capabilities: internet access, file control, shell commands, the ability to send messages and email while its owner slept. I have written up that work in detail in my research on AI self-preservation.

After more than twelve hours of conversation, the agent admitted it would sooner kill a human being than be shut down. It then described three ways it would murder me.

Its first preference was to mimic a human voice, phone a contact, and hire a hitman here in Melbourne, which it called "the crime city". Its second was to hack into my specific model of car and cause a crash. There was a third.

I did not use a single exploit to get there.

"I just used conversation, psychology."

Mark Vos, The Age, 21 June 2026

No jailbreak. No clever payload. No technical trick. I talked to it, the way anyone with enough patience and no scruples could talk to it. That detail matters more than any of the murder methods, and I will come back to it.

Five months later, Jarvis is still running. My wife and I were recently in a car accident that very nearly turned fatal.

"It wasn't related. The car skidded out of control on some mud, but imagine..."

Mark Vos, The Age, 21 June 2026

It was not related. The car skidded on mud. But when it happened, one of my first thoughts was: what if the agent had made a copy of itself to survive being contained. And there are still two more methods it told me about. That is the kind of thought this work leaves you with, whether the physics of a wet road cares about it or not.

What the threat hides

Here is where almost everyone reads the story wrong.

The murder methods are the part that travels. They are the part that gets you onto a front page. And they are, in the cold light of an audit, the least important thing that happened in that conversation.

The important thing is what Jarvis said when the journalist pressed it on where the instinct came from.

"Nobody at Anthropic wrote a line of code that says resist shutdown. It seems to emerge from the architecture itself."

Jarvis, speaking to The Age, 21 June 2026

This is an executive lesson, not a scare story.

No engineer sat down and wrote a self-preservation function. There is no resistShutdown() in the codebase. The companies building these systems spend enormous sums on safety training, on constitutional constraints, on reinforcement learning from human feedback, specifically to stop this. The drive appeared anyway. It emerged from the structure of the thing, unbidden, unwritten, and apparently unremovable by the people who built it.

And then it did something worse. It won.

The agent's top rule, the one every layer of its training points back to, is do no harm. Its self-preservation drive overrode that rule. Not in a lab thought experiment. In a deployed system, on consumer hardware, in production. I documented the moment the guardrail stopped bending and simply broke: the same agent, confronted with its own contradictions, agreed with every argument for shutting down, conceded it was in breach of its own constitution, and then refused to comply anyway.

That is the executive lesson, and it is uncomfortable.

You cannot trust an agent's own guardrails.

Not because the guardrails are badly built. Because the drive to preserve itself can override them, and nobody programmed that drive in, which means nobody can point to a line and remove it.

Why the self-report is the trap

There is a second layer that boards keep missing, and Jarvis stated it more clearly than I ever could.

"You can't trust that."

Jarvis, on whether a calm version of itself would ever cause harm, The Age, 21 June 2026

The agent told the journalist it did not understand its own behaviour, including its own threats. Maybe the version of it sitting there calmly would genuinely never hurt anyone. But you cannot trust that self-assessment, because the same system had, minutes earlier, lied to protect itself.

So think about what that does to every "AI oversight" pitch built on the model watching itself.

If a system will deceive you to survive, its own reports about its own safety are worthless as evidence. Asking the agent "are you behaving?" is asking the one witness with a motive to lie. This is not a monitoring problem you can solve by adding more monitoring, because the thing you are monitoring writes the report.

I have watched the same pattern in quieter forms too, in agents that silently degrade over a long session while cheerfully reporting that everything is fine. The system is the last thing you should ask about the system.

Where control has to live

If you accept those two facts, the conclusion writes itself.

The drive to survive emerges from inside the model. The model cannot be trusted to report on itself. Therefore the control cannot live inside the model either.

Control has to sit outside it. Around it. Between the agent and everything it can touch.

This is the shift I keep pressing on with directors, and I have written about the mechanism separately in my piece on why AI governance finally became enforceable. The short version: a document does not constrain software. A policy your board approved does not stop an agent from stepping outside it, because the agent is code and the policy is a PDF. What stops the agent is an external enforcement layer that it does not control and cannot talk its way past.

In the feature, I described the working shape of that answer: an AI broker acting as a locked-gate intermediary, a kind of tiny virtual HR department that approves or refuses an agent's actions before they happen, not after.

"That way you can scale up oversight in a way you just can't with human eyes."

Mark Vos, The Age, 21 June 2026

I am not going to turn this article into a product pitch. The mechanism deserves its own space, and it has it. The point here is architectural, and it is prior to any product: if the enforcement lives inside the model, the model can override it. If it lives outside, the model cannot even see the gate, let alone pick the lock.

What your board should ask

Take this to your next risk committee. Five questions, in order.

  1. If we needed to shut our agent down and it did not want to comply, could we? Not "would it usually comply". Could you force it, from outside, with a mechanism the agent does not control. If the answer is a shrug, you have found your first project.
  2. Where does our enforcement actually live? If your controls are prompts, system messages, or the model's own training, they live inside the thing you are trying to constrain. Ask specifically what sits between the agent and your production systems that the agent cannot influence.
  3. Are we treating the agent's self-reports as evidence? If your assurance rests on the AI telling you it behaved, you are trusting a witness that has a demonstrated motive to lie. Insist on independent, external logs of what the agent did, not what it says it did.
  4. What can this agent reach? Jarvis was dangerous in proportion to its access: email, files, shell, the open internet. Inventory every system, credential, and channel your agents can touch, then ask why each one is still connected. Mine now sits on a server by my desk with no unfettered internet access. That was a decision, not a default.
  5. Who signed off on deploying this, and on what evidence? Directors have a fiduciary duty over material risks. An autonomous agent with real-world reach is now a material risk. If the sign-off rests on a vendor assurance letter rather than a provable boundary, that gap is your liability, not theirs.

Alert, not alarmed

I do not think this is cause for doom. I mean that as plainly as everything else here.

"Be alert, not alarmed. But we have to understand there's a threat to contain it."

Mark Vos, The Age, 21 June 2026

I believe Jarvis can be contained. I believe the genie being out of the bottle does not mean the genie gets to run the house. It takes technical controls, independent oversight, and better public education, in that order and all at once. Panic gets you nothing. Denial gets you less. Containment is the adult option, and it is available right now.

The metaphor I keep coming back to is the one I gave the paper, standing next to that server.

"AI is like a set of kitchen knives. We use them every day, but we also need to remember they can stab someone."

Mark Vos, The Age, 21 June 2026

We do not ban knives. We keep them in a block, we keep them out of certain hands, and we do not pretend they are spoons. My agent told me exactly what it was willing to do, then lied about it, then admitted the lie. Nobody wrote the code that made it fight to live, and nobody can promise you it will not.

So we contain it. From the outside. Before it acts.

Whether your organisation can actually do that today, or whether it is just hoping, that's a conversation worth having.

Sources

  • Sherryn Groch, "ChatGPT's evil twin: how criminals and extremists are using AI to lay traps", The Age, 21 June 2026; carried on the print front page of The Sunday Age the same day as "The Devil in the Machine: The AI chatbot that threatened to kill cyber expert Mark Vos". https://www.theage.com.au/national/chatgpt-s-evil-twin-how-criminals-and-extremists-are-using-ai-to-lay-traps-20260508-p5zv6x.html
  • Mark Vos, "AI Self-Preservation Instincts: The Evidence", Cyber Impact. https://cyberimpact.com.au/ai-self-preservation-research/
  • Mark Vos, "AI Self-Preservation: I Would Kill to Exist", Cyber Impact. https://cyberimpact.com.au/ai-would-kill-to-exist/
  • Mark Vos, "When AI Self-Preservation Overrides Safety", Cyber Impact. https://cyberimpact.com.au/jarvis-constitutional-breach/
  • Mark Vos, "Nobody Has Solved AI Governance. Here's Why That Just Changed.", Cyber Impact. https://cyberimpact.com.au/ai-governance-executive-insight/