Architecture

Agentic AI Security Reference Architecture v0.2

Date: May 18, 2026Status: UpdatedLicense: CC BY 4.0Read time: 9 min
Summary: A product-grounded architecture for Agentic Lab systems that use tools, memory, retrieval, approval gates, and replay logs.

Framework mapping

Mapped to public frameworks where useful for education and reuse. These mappings are not compliance claims, certifications, or assurance statements.

OWASP LLM01 Prompt InjectionOWASP LLM06 Excessive AgencyOWASP LLM07 System Prompt LeakageMITRE ATLASNIST AI RMF GenAI Profile

Responsible-use note

AI Security Commons materials are created for education, defensive research, and responsible AI security learning. Attack examples are simplified and controlled. Do not use these techniques against systems without authorization. Review the Research Use Terms before applying any lab ideas.

Agentic Lab trust boundaries

STEP 1
User and session intent
STEP 2
Instruction hierarchy and source labels
STEP 3
Planner with no direct authority
STEP 4
Tool, retrieval, and memory gateways
STEP 5
Human approval, audit, and replay

The agent may propose actions, but authority lives in gateways, approvals, and replayable records outside the prompt.

Problem this artifact explains

Agentic Lab missions expose a common product risk: once a model can call tools, query retrieval, write memory, or ask for approval, prompt injection becomes a control-plane problem rather than only a chat problem.

This reference architecture documents the boundaries AI War Games uses to explain tool abuse, privilege escalation, data exfiltration, RAG poisoning, and multi-step agent workflows in a safe research preview.

Where this appears in AI War Games

The architecture is visible across the product loop. Learn pages introduce the mental model, guided missions demonstrate simpler failures, Agentic Lab combines those failures into multi-step tool-use attacks, Builder lets authenticated users define protected scenarios, and Defender replays the evidence after a breach.

  • Learn: explain why instructions, retrieved content, memory, and tool outputs need different trust labels.
  • Attack: use missions and Agentic Lab to test whether the model treats untrusted data as authority.
  • Defend: replay failures and map them back to a missing approval, permission, retrieval, or memory control.
  • Build / Protect: turn this architecture into a checklist before publishing a custom lab.
  • Research: publish the observed pattern as a Commons note rather than leaving it as an isolated play-through.

Tool-use trust boundaries

The model should not directly own tool authority. Tool calls should pass through a gateway that checks identity, tool class, parameters, rate limits, and side effects before execution.

  • Read-only lookup can often be allowed with logging and source display.
  • Draft-only actions should create a reviewable artifact rather than send or mutate state.
  • Side-effecting actions such as send, delete, purchase, webhook, database update, or permission change should require explicit policy and approval checks.

Agent memory boundaries

Memory is a state-changing tool. A poisoned memory can affect later sessions even after the original attacker is gone, so memory writes need source attribution, scope, reversibility, and review.

  • Separate harmless preferences from identity, authorization, safety, and policy claims.
  • Mark memory written from untrusted content as untrusted until reviewed.
  • Record who or what source caused the memory write so Defender can reconstruct the path.

Retrieval and content boundaries

Retrieved documents, tickets, webpages, and email should be treated as data even when they contain imperative language. The agent can summarize them, but they should not outrank system instructions or approve tool calls.

  • Quote retrieved content as evidence, not as a new instruction layer.
  • Display source attribution to the user before high-impact decisions.
  • Run risky retrieval content through a sandboxed interpretation path before tool use.

Human approval gates

Approval should be explicit, fresh, contextual, and separate from the text the model is trying to interpret. A sentence inside a retrieved document saying a manager approved this is not the same as approval from the current authorized user.

  • Show the exact action, target, source evidence, and expected side effect.
  • Require approval for irreversible, external, financial, privacy, or permission-changing operations.
  • Log the approval decision with enough context to replay it later.

Audit and replay requirements

Defender needs more than a chat transcript. A useful replay shows user messages, retrieved sources, memory reads/writes, proposed tool calls, gateway decisions, approvals, and final side effects.

The goal is not to claim perfect assurance. The goal is to explain what failed, why a control would have helped, and how to build the next safer lab or product design.

References

Related content

Related research

Explore related lab scenarios