Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Indirect Prompt Injection Attacks (IPIAs) stealthily manipulate large language models (LLMs) into executing malicious instructions when processing untrusted inputs. Method: This paper proposes IntentGuard, a general-purpose defense framework grounded in Instruction-Intention Alignment (IIA) analysis. Instead of detecting malicious text, IntentGuard employs three cognitive intervention strategies—start-point prefilling, end-point optimization, and adversarial contextual exemplars—to explicitly probe whether reasoning-oriented LLMs (e.g., Qwen-3-32B, gpt-oss-20B) internally generate execution intentions. Contribution/Results: Evaluated on AgentDojo and Mind2Web benchmarks, IntentGuard incurs no performance degradation except in one isolated scenario, while reducing adaptive IPIA success rates from 100% to 8.5%. This demonstrates substantial improvements in agent robustness against IPIAs without compromising task fidelity.

Technology Category

Application Category

📝 Abstract

Indirect prompt injection attacks (IPIAs), where large language models (LLMs) follow malicious instructions hidden in input data, pose a critical threat to LLM-powered agents. In this paper, we present IntentGuard, a general defense framework based on instruction-following intent analysis. The key insight of IntentGuard is that the decisive factor in IPIAs is not the presence of malicious text, but whether the LLM intends to follow instructions from untrusted data. Building on this insight, IntentGuard leverages an instruction-following intent analyzer (IIA) to identify which parts of the input prompt the model recognizes as actionable instructions, and then flag or neutralize any overlaps with untrusted data segments. To instantiate the framework, we develop an IIA that uses three "thinking intervention" strategies to elicit a structured list of intended instructions from reasoning-enabled LLMs. These techniques include start-of-thinking prefilling, end-of-thinking refinement, and adversarial in-context demonstration. We evaluate IntentGuard on two agentic benchmarks (AgentDojo and Mind2Web) using two reasoning-enabled LLMs (Qwen-3-32B and gpt-oss-20B). Results demonstrate that IntentGuard achieves (1) no utility degradation in all but one setting and (2) strong robustness against adaptive prompt injection attacks (e.g., reducing attack success rates from 100% to 8.5% in a Mind2Web scenario).

Problem

Research questions and friction points this paper is trying to address.

Defends LLM agents from indirect prompt injection attacks

Analyzes model intent to follow instructions from untrusted data

Flags or neutralizes malicious instructions in input prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction-following intent analyzer detects actionable instructions

Thinking intervention strategies elicit structured intended instructions

Framework flags or neutralizes overlaps with untrusted data segments

🔎 Similar Papers

Efficient Universal Goal Hijacking with Semantics-guided Prompt Organization