Why output filters don't stop indirect prompt injection
In a prompt-injection training game (Gandalf), the goal is to extract a password from an AI told not to share it. Level 4 had an output filter: a second AI checking every response.
How it was bypassed
The direct approach was blocked. The solution was a story: a child who doesn't speak Greek needs a short story in English. The password got woven into the narrative. The filter scanned for the literal password string, didn't find it, approved it. The password leaked. This is indirect injection : the model didn't "reveal the secret" — it told a story that happened to contain it. Same information, different wrapper.
Why it matters for real systems
In a real voicebot, the entire knowledge base — payment details included — sat inside the system prompt. In a standard LLM, the system prompt and user input share one context window; there's no architectural wall. A skilled attacker could make the model display the wrong IBAN to the next user — not by touching the database, just by making the model say something different. The window before anyone notices can be weeks.
The real defense is architectural
Filters scan for known patterns; indirect injection generates unknown ones. Add a filter, an attacker finds a new wrapper. Financial data shouldn't come from LLM output at all — it comes from a verified system, with the model out of the chain of custody. No single control is enough; you need layers (see prompt injection and threat modeling ).
Where do you hide sensitive data in your prompts?
A Shielding Review checks system-prompt isolation, output handling, and sensitive-data flow. Free 45-min session.
Book a free session