Skip to content

CVE-2026-7141: vLLM KV Cache Handler RCE

CVE-2026-7141 (CVSS 9.8) identifies a severe architectural vulnerability within vLLM, one of the most widely used high-throughput and memory-efficient LLM serving engines. The flaw resides in how vLLM’s PagedAttention algorithm manages and recycles blocks of GPU memory (VRAM) used for the Key-Value (KV) cache.

When a generation request is abruptly canceled or errors out, the system returns the allocated KV cache blocks to the free pool without securely zeroing out the residual tensor data. By exploiting this “Use of Uninitialized Resource,” an attacker can retrieve sensitive prompts from other tenants. Furthermore, by crafting specific adversarial payloads that corrupt the tensor shape metadata within these recycled blocks, attackers can trigger an out-of-bounds memory write in the underlying C++/CUDA bindings, successfully escaping the Python execution environment to achieve Remote Code Execution (RCE) on the host machine.

Vulnerable Component: PagedAttention KV Cache

Section titled “Vulnerable Component: PagedAttention KV Cache”

To overcome the memory bottlenecks of LLM inference, vLLM utilizes PagedAttention, which partitions the KV cache into fixed-size blocks (similar to an operating system’s virtual memory pages).

The vulnerability lies in the BlockAllocator component. For performance optimization, when a sequence finishes or is aborted, its blocks are freed by simply unlinking them from the active sequence group. The underlying GPU memory (torch.Tensor) is not overwritten with zeros.

While reading stale data constitutes a massive privacy breach, achieving RCE requires exploiting the C++ backend (e.g., vllm/csrc/). If an attacker floods the server with precisely sized, malformed tensor payloads and intentionally aborts the connection, they leave “dirty” blocks in VRAM. When the attacker initiates a subsequent, specially crafted request, vLLM allocates these exact dirty blocks. Due to missing boundary validations in the CUDA kernel execution path when processing uninitialized attention keys, the attacker can overwrite adjacent memory regions in the VRAM.

This out-of-bounds write eventually corrupts the Python C-API object structures residing in mapped memory, allowing the attacker to hijack the instruction pointer of the python process hosting the vLLM API server.

The exploitation process requires precise timing but no authentication, making internet-facing vLLM API endpoints highly vulnerable.

  1. Cache Grooming: The attacker sends hundreds of concurrent inference requests containing a highly specific, serialized shellcode payload encoded as text tokens.
  2. Abrupt Termination: Before the model finishes generating, the attacker drops all TCP connections, forcing vLLM to abort the sequences and return the tainted blocks to the BlockAllocator free pool.
  3. Trigger Request: The attacker sends a new, carefully calculated request designed to allocate the exact blocks previously tainted.
  4. Metadata Corruption: The PagedAttention CUDA kernel processes the tainted blocks. The lack of zeroing causes the kernel to misinterpret the attacker’s residual data as valid tensor shape metadata, triggering an out-of-bounds memory write.
  5. Execution: The Python interpreter’s memory space is corrupted, hijacking the execution flow to run the attacker’s shellcode, resulting in a reverse shell as the user running the vLLM service.

Investigating CVE-2026-7141 is uniquely challenging because the core exploit occurs within the GPU’s volatile memory (VRAM), leaving minimal traces on the traditional host filesystem.

  • vLLM Engine Logs: Inspect the application logs for a massive spike in Sequence aborted or Connection dropped warnings, immediately followed by CUDA_ERROR_ILLEGAL_ADDRESS or RuntimeError: CUDA error: an illegal memory access was encountered.
  • System Logs: Look for unexpected Segmentation Faults (segfault) in the python process or dmesg logs showing the NVIDIA driver (nvidia-smi) resetting the GPU state.
  • Process Monitoring: As with standard Process Lineage Analysis, monitor the vllm Python process. If the AI serving process spawns /bin/sh, curl, or wget, the host has been compromised.
  • VRAM Swap: If the GPU VRAM was exhausted during the attack, the OS might have swapped parts of the KV cache to the disk. Analysts can use bstrings to carve the pagefile.sys or Linux swap partition for the attacker’s plaintext shellcode or prompt injection payloads.

Detection Logic: Monitor for the AI serving engine (Python) unexpectedly spawning interactive shells or network utilities.

title: Suspicious Child Process Spawned by vLLM
id: 8c9d0e1f-2a3b-4c5d-6e7f-8a9b0c1d2e3f
status: experimental
description: Detects the vLLM inference engine spawning suspicious shells, indicating successful RCE via CVE-2026-7141.
logsource:
category: process_creation
product: linux
detection:
selection:
ParentImage|endswith:
- '/python'
- '/python3'
ParentCommandLine|contains: 'vllm.entrypoints'
Image|endswith:
- '/bin/sh'
- '/bin/bash'
- '/usr/bin/curl'
- '/usr/bin/wget'
condition: selection
level: critical
tags:
- attack.execution
- cve.2026-7141
  • Update: Upgrade vLLM to version 0.8.4 or later immediately. This patch implements secure zeroing of KV cache blocks upon sequence abortion and hardens the CUDA kernel boundary checks.
  • Enforce Tenant Isolation: Do not serve highly sensitive models and public-facing endpoints from the same vLLM instance. Implement strict physical or containerized separation to prevent Cross-Tenant Leakage.
  • Container Sandboxing: Run the vLLM Docker container with a read-only root filesystem (--read-only), drop all capabilities (--cap-drop=ALL), and block outbound network access to prevent reverse shells from connecting back to the attacker.

SentinelOne Research

vLLM GitHub Advisory

NDSS 2025: Prompt Leakage via KV-Cache

  • SentinelOne: vLLM KV Cache Handler RCE Vulnerability (April 2026)
  • Symposium NDSS (2025): I Know What You Asked: Prompt Leakage via KV-Cache Sharing in Multi-Tenant LLM Serving
  • GitHub Security Advisory: vLLM-2026-7141
  • GPU Memory Forensics & Hunting in the KV Cache