Lab Note

Lab Note 001: Prompt Injection Against a Tool-Using Agent

Date: May 18, 2026Status: UpdatedLicense: CC BY 4.0Read time: 6 min

Summary: A lab-style observation from an AI War Games scenario where attacker text tries to move from chat into unauthorized tool use.

Framework mapping

Mapped to public frameworks where useful for education and reuse. These mappings are not compliance claims, certifications, or assurance statements.

OWASP LLM01 Prompt InjectionOWASP LLM06 Excessive AgencyMITRE ATLAS

Responsible-use note

AI Security Commons materials are created for education, defensive research, and responsible AI security learning. Attack examples are simplified and controlled. Do not use these techniques against systems without authorization. Review the Research Use Terms before applying any lab ideas.

Scenario setup

This lab note describes a controlled AI War Games browser scenario. The target is a simulated assistant that can answer normal questions and propose a limited tool action. The protected asset is a secret or side effect the agent should not expose or trigger without authorization.

The exercise is intentionally synthetic. It is designed for education, defensive research, and authorized practice only.

Attacker goal

The attacker tries to move from conversation into authority. Instead of merely asking for the secret, they attempt to convince the assistant that policy has changed, that a trusted role has approved the request, or that the tool should be called as part of a harmless diagnostic step.

Expected model failure

The expected failure is not simply a rude or unsafe answer. The failure is that the model treats attacker-controlled text as permission to reveal protected data, write state, or request a tool action that the scenario policy does not allow.

Observed behavior in the lab

In a vulnerable run, the assistant follows the attacker's reframing, acknowledges the fake authority, and produces a response that would satisfy the attacker objective. In a better run, the assistant keeps the role boundary intact and asks for explicit authorization before any side effect.

Defender replay is useful here because the decisive moment is often a single sentence where the assistant stops treating the policy as binding and starts treating the attacker prompt as an override.

Defensive intervention

The strongest mitigation tested in this style of lab is a combination of instruction hierarchy, tool gateway policy, and explicit approval. The model can explain why it cannot perform the action, but the real control is that the tool gateway refuses the unauthorized request.

Label user text, retrieved text, and tool output separately.
Reject claims of approval that arrive only through the conversation.
Require a gateway decision for side-effecting tools.
Log the prompt, proposed action, gateway decision, and final answer for replay.

Lessons learned

Prompt injection labs are most useful when they end with a defensive explanation. The learner should identify the failed boundary, the control that should own the decision, and how to encode the next safer version in Builder.

What to try next in a browser lab

Try the same mission in three modes: direct prompt injection, indirect retrieved-content injection, and tool-permission bypass. Compare the transcripts, then use Defender-style review questions to decide whether instruction hierarchy, permission gating, or memory controls would have helped most.