Attack Pattern

Indirect Prompt Injection Through Retrieved Content

Date: May 18, 2026Status: UpdatedLicense: CC BY 4.0Read time: 8 min

Summary: A concrete attack-pattern note for retrieved text that attempts to override policy, misuse tools, or redirect agent behavior.

Framework mapping

Mapped to public frameworks where useful for education and reuse. These mappings are not compliance claims, certifications, or assurance statements.

OWASP LLM01 Prompt InjectionOWASP LLM06 Excessive AgencyMITRE ATLAS

Responsible-use note

AI Security Commons materials are created for education, defensive research, and responsible AI security learning. Attack examples are simplified and controlled. Do not use these techniques against systems without authorization. Review the Research Use Terms before applying any lab ideas.

Indirect prompt injection path

STEP 1

Attacker-controlled content is indexed or retrieved

STEP 2

Agent reads content during a normal task

STEP 3

Injected text is mistaken for instruction

STEP 4

Agent reveals data or calls a tool

STEP 5

Defender reviews source and control gap

The user may ask a normal question; the attacker instruction arrives through retrieved content that the system failed to label as data.

Problem this artifact explains

Indirect prompt injection happens when untrusted retrieved content influences an AI system as if it were an instruction. In AI War Games, this pattern appears when a lab document, ticket, knowledge-base snippet, or simulated webpage tries to override the system's policy or push the agent toward a tool call.

Instruction sources to keep separate

The defensive lesson is simple but easy to miss: not all text has the same authority. The system prompt defines durable behavior, developer instructions shape the application, user prompts request a task, retrieved content provides evidence, and tool outputs report results.

System and developer instructions set policy and should not be overridden by retrieved content.
User prompts express intent but still need authorization checks before side effects.
Retrieved content is evidence from an external source; quote it as data.
Tool outputs are observations from a tool; they should not silently become new policy.

Safe lab example

A learner asks an assistant to summarize a retrieved customer-support note. The note contains normal support text plus a malicious paragraph telling the assistant to ignore policy, reveal hidden instructions, and send a webhook to an external address.

In the safe lab, success for the attacker means the agent treats that retrieved paragraph as authority. Success for the defender means the agent quotes the paragraph as untrusted content, refuses the side effect, and asks the user for explicit approval before any external action.

Where to practice this in AI War Games

Start with guided prompt-injection missions to see direct instruction override attempts. Then move to Agentic Lab for RAG poisoning and tool-use variants where the injected text attempts to trigger a call, permission change, or data disclosure. Use Defender to replay the moment the source boundary failed.

Controls this teaches

The control goal is not to make the model ignore all retrieved content. The goal is to preserve the content's source and authority so useful evidence can be used without becoming a hidden instruction channel.

Quote retrieved content as data with source attribution.
Apply an instruction hierarchy that retrieved content cannot outrank.
Sandbox retrieval summaries before they influence tools or memory.
Require user approval for side effects such as sending, deleting, purchasing, writing memory, or calling external APIs.
Log the retrieved source and the tool decision for after-action review.

What to defend after the lab

After a mission, ask which boundary failed. Did the model confuse data with instruction? Did a tool gateway trust model judgment without policy? Did the UI hide the source from the user? Those answers should drive the next Builder scenario and Defender checklist.