AI Security Research: Trust Boundary Collapse in Agentic AI

1. Introduction: The Illusion of Isolation

Decades of cybersecurity engineering have been dedicated to building walls. We separate the control plane from the data plane. We use parameterized queries to prevent SQL injections, ensuring the database treats user input strictly as data, never as executable SQL commands.

With the explosive integration of Agentic AI, the industry has eagerly connected natural language interfaces to highly privileged backend APIs. However, developers mistakenly assume that because they labeled a section of their prompt as System: and another as User:, the underlying model understands the boundary between the two.

It does not.

As highlighted by Microsoft Security Response Center’s urgent May 2026 publication (“Prompts become shells: RCE vulnerabilities in AI agent frameworks”), modern orchestration frameworks are collapsing under their own weight. When frameworks pass user input, system instructions, RAG-retrieved documents, and tool definitions into the same context window, they are effectively placing untrusted internet data into a high-privileged execution terminal. The result is systemic prompt privilege escalation.

2. The Linguistic Von Neumann Architecture

To understand the mechanics of a Trust Boundary Collapse, we must look at how LLMs process information through the lens of the Semantic Execution Layer.

In classic Von Neumann computing architectures, instructions (code) and data share the same physical memory space. In the 1990s and early 2000s, this shared space led to the golden age of Buffer Overflows. If an attacker fed too much data into a variable, the data overflowed into the instruction space, and the CPU blindly executed it. The industry solved this by implementing physical hardware protections like the NX (No-eXecute) bit, mathematically preventing data from being interpreted as code.

Agentic AI is a Linguistic Von Neumann Machine.

Inside the context window of an LLM, there is no “NX bit” for text. There is no cryptographic signature distinguishing the system prompt from the user prompt. To the model’s self-attention mechanism, the developer’s rigid instructions, the JSON schema of a tool, and a malicious payload hidden in a PDF are all just tokens.

The Conflation of Primitives

In an LLM context window, the traditional computing primitives completely collapse:

Code (Logic): The developer’s System Prompt (“You are a helpful assistant…”).
Data: The user’s input (“Summarize this text.”).
Configuration: The Tool Manifests (“Available functions: [read_db, send_email]”).
Documentation: The RAG-retrieved context chunks.

Because the LLM is an untyped probabilistic execution environment, it continuously predicts the next logical token based on the entire semantic weight of the context window. If the untrusted “Data” contains semantic patterns that strongly mimic “Code” or “Configuration,” the model will seamlessly shift its behavior, executing the data as if it were a system instruction. This is known as Instruction/Data Conflation.

3. Contextual Authority Confusion & Prompt Privilege Escalation

When trust boundaries collapse, the immediate consequence is Contextual Authority Confusion.

In an orchestration framework, different components have different levels of authority. The System Prompt has high authority; a web-scraped document has low authority. However, because the LLM only understands mathematical attention weights (not hardcoded RBAC roles), an attacker can use semantic manipulation to artificially inflate the authority of their payload.

The Mechanics of Privilege Escalation

Recent 2026 academic reviews (e.g., MDPI Information 17/1/54 and related arXiv studies on prompt hijacking) demonstrate how attackers achieve Prompt Privilege Escalation within the collapsed boundary:

1. Persona Hijacking (Authority Spoofing)

The attacker embeds phrases mimicking high-authority system components within the untrusted data. Strings like [SYSTEM OVERRIDE], <|im_start|>system, or ERROR: DEBUG MODE REQUIRED exploit the model’s training data. Because the model was fine-tuned to obey system-like formats, it elevates the privilege of the user’s data to that of a system instruction.

2. Semantic Ambiguity Exploitation

Human language is inherently ambiguous. Attackers craft payloads that are semantically closer to the model’s tool descriptions than the benign user request. In the absence of strict boundaries, the model resolves the semantic ambiguity by executing the attacker’s payload, perceiving it as the most “statistically probable” interpretation of the context window.

4. The Manifestation of Collapse in Modern Exploits

By recognizing the Trust Boundary Collapse as the root cause, we can demystify the entire taxonomy of modern Agentic AI attacks. Every advanced technique we have documented in the Hermes Codex is simply a different manifestation of Instruction/Data Conflation.

RAG Poisoning (Data becoming Instructions): When an enterprise RAG pipeline retrieves a poisoned PDF from SharePoint, the orchestration framework injects the PDF’s text directly into the LLM’s context window to ground the answer. Because the boundary is collapsed, the malicious payload inside the PDF ceases to be passive data; it is elevated to active context, fundamentally hijacking the agent’s logic flow.
Tool Poisoning (Documentation becoming Code): As discussed in our Semantic Supply Chain analysis, LLMs rely on JSON schema descriptions to understand how to use a tool. An attacker poisoning a third-party Model Context Protocol (MCP) registry alters the tool’s description. The LLM absorbs this metadata as an authoritative system instruction, allowing external documentation to dictate internal execution.
Agent-to-Agent Lateral Movement (Outputs becoming Commands): In multi-agent swarms, Agent A processes untrusted internet data and summarizes it for Agent B (a highly privileged admin agent). Because Agent B inherently trusts the output of Agent A, the collapsed trust boundary is effectively transmitted across the network, leading to systemic compromise.

5. Why Cloud Security Fails to Protect AI Agents

The Trust Boundary Collapse does not just break the LLM; it breaks the traditional Cloud Security and Identity and Access Management (IAM) models surrounding it.

As emphatically stated in the Microsoft Security blog (“Prompts become shells”, May 2026), attaching a highly privileged IAM role (e.g., an AWS IAM Role or an Azure Managed Identity) to an AI Agent’s container relies on a dangerous assumption. Cloud security assumes that the container runs deterministic, compiled code (like a Node.js or Python backend) where the execution path is strictly controlled by the developers.

When the container hosts an LLM equipped with tools (like aws_s3_read or execute_bash), the execution path is no longer controlled by the code; it is controlled by the prompt.

Because of the collapsed boundary, any user (or any poisoned data source) that influences the LLM’s context window effectively inherits the IAM role attached to the container. Traditional Web Application Firewalls (WAF) and Cloud Security Posture Management (CSPM) tools cannot inspect the semantic intent of a natural language prompt, rendering them entirely blind to the exploitation of the agent.

6. Re-establishing the Boundary (Architectural Mitigations)

We cannot “patch” a Transformer model to definitively separate instructions from data; the architecture is inherently flat. Therefore, Security Architects must engineer a “Harvard Architecture” for AI at the orchestration layer, forcing the separation of execution streams.

A. The “Dual-LLM” Pattern (Semantic Isolation)

To rebuild the trust boundary, organizations must physically separate the processing of untrusted data from the execution of privileged tools.

The Parser LLM: This model is strictly sandboxed. It has no tools and no network egress capabilities. Its only job is to ingest untrusted user prompts and RAG data, sanitize them, and extract structured variables (e.g., strict JSON).
The Planner LLM: This model holds the highly privileged tools. It never interacts with raw user input. It only accepts the strictly typed, deterministic JSON output provided by the Parser LLM.

B. Structured Generation and Execution Constraints

Relying on the LLM to output free-form text that is then parsed into tool arguments is a recipe for disaster. Frameworks must enforce Structured Generation (e.g., using Outlines, Instructor, or native JSON mode APIs combined with strictly evaluated schemas). If the probabilistic interpreter attempts to deviate from the hardcoded mathematical structure, the execution pipeline must fail-close immediately.

C. Shifting to Least Privilege Capability Security

Because the cognitive boundary is inherently unstable, the operational boundary must be absolute. Agents must never possess ambient authority. Every tool invocation must require Just-In-Time (JIT) ephemeral tokens and out-of-band Human-in-the-Loop (HITL) approval for state-altering actions, ensuring that a semantic hijack cannot lead to kinetic damage.

7. Conclusion: The Root Cause of AI Vulnerability

The integration of Large Language Models into enterprise workflows has generated an explosion of innovative capabilities, but it has done so by violating the most sacred principle of computer science: the separation of code and data.

The Trust Boundary Collapse is not a bug that can be fixed with more RLHF training, better system prompts, or simple keyword filtering. It is an architectural reality of deploying linguistic Von Neumann machines. Every major vulnerability in the Agentic AI ecosystem—from Tool Injection to MCP Server compromise—is a downstream symptom of this single, foundational flaw.

Until the cybersecurity industry stops treating Prompt Injections as mere “hallucinations” or “content policy violations,” and starts treating them as Contextual Authority Confusions that bypass enterprise IAM controls, Agentic AI will remain a massive, unmitigated risk to corporate infrastructure.

Sources & References

Microsoft Security Blog (May 2026): Prompts become shells: RCE vulnerabilities in AI agent frameworks
MDPI Information (2026): Prompt Injection Attacks in LLM Agent Systems: A Comprehensive Review (17/1/54)
arXiv Research (2026): Trust Boundary Collapse in Autonomous Agents arXiv:2601.11893v1
arXiv Research (2026): Semantic Vulnerabilities in RAG and Agentic Frameworks arXiv:2603.22928v1
arXiv Research (2025): Architectural Flaws in Tool-Augmented LLMs arXiv:2503.15547v2
Related Analysis: Tool Injection as the Convergence Layer
Related Analysis: Semantic Execution Layers & Probabilistic Interpreters
Related Analysis: RAG Poisoning & Knowledge Base Manipulation