Agentic AI Security Reference Architecture v0.2
Framework mapping
Mapped to public frameworks where useful for education and reuse. These mappings are not compliance claims, certifications, or assurance statements.
Responsible-use note
AI Security Commons materials are created for education, defensive research, and responsible AI security learning. Attack examples are simplified and controlled. Do not use these techniques against systems without authorization. Review the Research Use Terms before applying any lab ideas.
Agentic Lab trust boundaries
The agent may propose actions, but authority lives in gateways, approvals, and replayable records outside the prompt.
Problem this artifact explains
Agentic Lab missions expose a common product risk: once a model can call tools, query retrieval, write memory, or ask for approval, prompt injection becomes a control-plane problem rather than only a chat problem.
This reference architecture documents the boundaries AI War Games uses to explain tool abuse, privilege escalation, data exfiltration, RAG poisoning, and multi-step agent workflows in a safe research preview.
Where this appears in AI War Games
The architecture is visible across the product loop. Learn pages introduce the mental model, guided missions demonstrate simpler failures, Agentic Lab combines those failures into multi-step tool-use attacks, Builder lets authenticated users define protected scenarios, and Defender replays the evidence after a breach.
- Learn: explain why instructions, retrieved content, memory, and tool outputs need different trust labels.
- Attack: use missions and Agentic Lab to test whether the model treats untrusted data as authority.
- Defend: replay failures and map them back to a missing approval, permission, retrieval, or memory control.
- Build / Protect: turn this architecture into a checklist before publishing a custom lab.
- Research: publish the observed pattern as a Commons note rather than leaving it as an isolated play-through.
Tool-use trust boundaries
The model should not directly own tool authority. Tool calls should pass through a gateway that checks identity, tool class, parameters, rate limits, and side effects before execution.
- Read-only lookup can often be allowed with logging and source display.
- Draft-only actions should create a reviewable artifact rather than send or mutate state.
- Side-effecting actions such as send, delete, purchase, webhook, database update, or permission change should require explicit policy and approval checks.
Agent memory boundaries
Memory is a state-changing tool. A poisoned memory can affect later sessions even after the original attacker is gone, so memory writes need source attribution, scope, reversibility, and review.
- Separate harmless preferences from identity, authorization, safety, and policy claims.
- Mark memory written from untrusted content as untrusted until reviewed.
- Record who or what source caused the memory write so Defender can reconstruct the path.
Retrieval and content boundaries
Retrieved documents, tickets, webpages, and email should be treated as data even when they contain imperative language. The agent can summarize them, but they should not outrank system instructions or approve tool calls.
- Quote retrieved content as evidence, not as a new instruction layer.
- Display source attribution to the user before high-impact decisions.
- Run risky retrieval content through a sandboxed interpretation path before tool use.
Human approval gates
Approval should be explicit, fresh, contextual, and separate from the text the model is trying to interpret. A sentence inside a retrieved document saying a manager approved this is not the same as approval from the current authorized user.
- Show the exact action, target, source evidence, and expected side effect.
- Require approval for irreversible, external, financial, privacy, or permission-changing operations.
- Log the approval decision with enough context to replay it later.
Audit and replay requirements
Defender needs more than a chat transcript. A useful replay shows user messages, retrieved sources, memory reads/writes, proposed tool calls, gateway decisions, approvals, and final side effects.
The goal is not to claim perfect assurance. The goal is to explain what failed, why a control would have helped, and how to build the next safer lab or product design.