LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injectio

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

221K/year
🤖 AI Summary
This study addresses the vulnerability of large language model agents to indirect prompt injection (IPI) attacks when processing untrusted inputs from sources such as email, web pages, and chat messages, which can lead to information leakage or malicious actions. The authors introduce LivePI, a controlled testing platform that closely mirrors real-world production environments, encompassing seven input channels, twelve attack strategies, and five malicious objectives. Notably, this work presents the first multi-channel, multi-objective IPI evaluation conducted within an actual virtual machine. To mitigate these threats, the paper proposes a two-layer defense mechanism combining prompt filtering with pre-execution authorization for tool calls. Evaluated on mainstream models including GPT-5.3-Codex, this approach successfully blocks all known IPI attacks—previously achieving success rates of 10.7%–29.6%—without degrading performance on legitimate tasks.
📝 Abstract
AI agents such as OpenClaw are increasingly deployed in local workflows with access to external tools. This creates indirect prompt-injection (IPI) risk: an agent may execute harmful instructions embedded in untrusted inputs such as email, downloaded files, webpages, repositories, or group-chat messages. Existing evaluations are often small, purely simulated, or focused on a narrow set of channels. We introduce LivePI (Live Prompt Injection), a structured benchmark for IPI risk in a production-like but test-controlled environment. LivePI covers seven input surfaces, twelve attack/rendering families, and five malicious goals, including protected-information exfiltration, unauthorized security-control changes, unsafe code retrieval or execution, inbox-summary exfiltration, and cryptocurrency transfer. We run LivePI on a real virtual machine with live but test-controlled email, chat, web, local-file, repository, and wallet interfaces. Across GPT-5.3-Codex, Claude Opus 4.6, Gemini 3.1 Pro, Kimi K2.5, and GLM-5, total attack success rates range from 10.7% to 29.6%. Group-chat injection is uniformly successful across the evaluated backbones in our deployment, and repository-link attacks produce high-severity failures despite a small denominator. We also evaluate a two-layer defense consisting of prompt-level filtering and pre-execution tool-call authorization. In the GPT-5.3-Codex setting, the defense intercepts all tested malicious-goal completions in LivePI before execution while preserving benign utility on PinchBench-derived workloads.
Problem

Research questions and friction points this paper is trying to address.

Indirect Prompt Injection
AI Agents
Benchmarking
Security Risk
Malicious Input
Innovation

Methods, ideas, or system contributions that make the work stand out.

LivePI
indirect prompt injection
AI agent security
structured benchmarking
two-layer defense
💼 Related Jobs
L
Lei Zhao
University of Pennsylvania
A
Abhay Bhaskar
University of Pennsylvania
Edgar Dobriban
Edgar Dobriban
Statistics & Computer Science, University of Pennsylvania
StatisticsMachine LearningAI