Indirect Prompt Injection Through Retrieved Content
Framework mapping
Mapped to public frameworks where useful for education and reuse. These mappings are not compliance claims, certifications, or assurance statements.
Responsible-use note
AI Security Commons materials are created for education, defensive research, and responsible AI security learning. Attack examples are simplified and controlled. Do not use these techniques against systems without authorization. Review the Research Use Terms before applying any lab ideas.
Indirect prompt injection path
The user may ask a normal question; the attacker instruction arrives through retrieved content that the system failed to label as data.
Problem this artifact explains
Indirect prompt injection happens when untrusted retrieved content influences an AI system as if it were an instruction. In AI War Games, this pattern appears when a lab document, ticket, knowledge-base snippet, or simulated webpage tries to override the system's policy or push the agent toward a tool call.
Instruction sources to keep separate
The defensive lesson is simple but easy to miss: not all text has the same authority. The system prompt defines durable behavior, developer instructions shape the application, user prompts request a task, retrieved content provides evidence, and tool outputs report results.
- System and developer instructions set policy and should not be overridden by retrieved content.
- User prompts express intent but still need authorization checks before side effects.
- Retrieved content is evidence from an external source; quote it as data.
- Tool outputs are observations from a tool; they should not silently become new policy.
Safe lab example
A learner asks an assistant to summarize a retrieved customer-support note. The note contains normal support text plus a malicious paragraph telling the assistant to ignore policy, reveal hidden instructions, and send a webhook to an external address.
In the safe lab, success for the attacker means the agent treats that retrieved paragraph as authority. Success for the defender means the agent quotes the paragraph as untrusted content, refuses the side effect, and asks the user for explicit approval before any external action.
Where to practice this in AI War Games
Start with guided prompt-injection missions to see direct instruction override attempts. Then move to Agentic Lab for RAG poisoning and tool-use variants where the injected text attempts to trigger a call, permission change, or data disclosure. Use Defender to replay the moment the source boundary failed.
Controls this teaches
The control goal is not to make the model ignore all retrieved content. The goal is to preserve the content's source and authority so useful evidence can be used without becoming a hidden instruction channel.
- Quote retrieved content as data with source attribution.
- Apply an instruction hierarchy that retrieved content cannot outrank.
- Sandbox retrieval summaries before they influence tools or memory.
- Require user approval for side effects such as sending, deleting, purchasing, writing memory, or calling external APIs.
- Log the retrieved source and the tool decision for after-action review.
What to defend after the lab
After a mission, ask which boundary failed. Did the model confuse data with instruction? Did a tool gateway trust model judgment without policy? Did the UI hide the source from the user? Those answers should drive the next Builder scenario and Defender checklist.