Prompt injection has become the canonical example of “AI security,” and the canonical responses are all variations on the same move: write a better system prompt, add a guard model, fine-tune for refusals. Each one makes the model harder to fool. None of them changes what happens when the model is fooled anyway.
The model is not a security boundary
A large language model is a probabilistic text generator pointed at a set of tools. When an injected instruction convinces it to call delete_customer or read a row it shouldn’t, the model isn’t malfunctioning — it’s doing exactly what it was built to do with the input it was given. Treating don’t get convinced as the control is asking a system whose entire job is to be convinced to be selectively un-convincible. That is not a property you can reliably train in.
The uncomfortable consequence: any defense that lives inside the model shares the model’s failure mode. If the same component decides both what to do and whether it’s allowed to, a single successful injection collapses both decisions at once.
Move the decision out of the model
The fix is structural, not linguistic. Authorization has to be evaluated by something the prompt can’t reach: a policy layer that sits in the request path between the agent and the tool, looks at the concrete action and its context, and decides — independently of why the agent wants to do it.
An agent persuaded to wire money still has to clear a rule that says this agent, in this context, may not move funds over a threshold without a human. The injection succeeds at changing the model’s intent and still produces a denied request and an unchanged ledger.
// The agent's *reason* never enters the decision. Only the action + context do.
allow("payments.transfer", (action, ctx) => {
if (action.amount > 10_000) return requireHuman(ctx.owner);
return ctx.agent.scopes.includes("payments:write");
});What changes when you do this
Reframing injection as an authorization failure rather than a prompt failure changes what you build:
- You scope agents like you scope service accounts — least privilege, per action, not “the model knows better.”
- You evaluate policy in the request path and fail closed, so an undecidable request is a denied request.
- You get an audit trail of decisions, not just traces of behavior, because every action was adjudicated.
- You stop shipping fixes to the prompt and start shipping fixes to the policy — which is testable.
The uncomfortable upside
This is less exciting than a jailbreak-proof prompt, and that’s the point. Authorization is a solved discipline — we’ve spent decades enforcing it for human users and service accounts. Agents are just a new kind of principal. Once you stop asking the model to police itself and start enforcing what each agent is allowed to do, prompt injection stops being an existential threat and becomes what it always was: an attempt to take an action the caller isn’t authorized to take.
That’s the layer we build at VisIQ Labs. If you’re wrestling with agent authorization, the documentation is the place to start.