DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks

📅 2025-04-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Adaptive prompt injection attacks against large language models (LLMs) increasingly evade static detectors by strategically crafting inputs, posing a critical security challenge. Method: We propose the first game-theoretic detection framework, formulating prompt injection detection as a min-max adversarial game: an inner loop maximizes attack stealth via gradient-based alternating optimization, while an outer loop minimizes detection error through adversarial fine-tuning. Contribution/Results: Our framework achieves the first generalizable detection capability against *unseen* adaptive attacks—overcoming the fundamental limitation of prior methods that rely on predefined attack patterns. Extensive experiments across multiple LLMs (Llama-3, Qwen2, GPT-4o) and benchmarks (AdvBench, TREND, PIA) demonstrate that our approach significantly outperforms state-of-the-art methods in both detection accuracy and robustness, maintaining high performance against both known and novel adaptive attacks.

Technology Category

Application Category

📝 Abstract
LLM-integrated applications and agents are vulnerable to prompt injection attacks, where an attacker injects prompts into their inputs to induce attacker-desired outputs. A detection method aims to determine whether a given input is contaminated by an injected prompt. However, existing detection methods have limited effectiveness against state-of-the-art attacks, let alone adaptive ones. In this work, we propose DataSentinel, a game-theoretic method to detect prompt injection attacks. Specifically, DataSentinel fine-tunes an LLM to detect inputs contaminated with injected prompts that are strategically adapted to evade detection. We formulate this as a minimax optimization problem, with the objective of fine-tuning the LLM to detect strong adaptive attacks. Furthermore, we propose a gradient-based method to solve the minimax optimization problem by alternating between the inner max and outer min problems. Our evaluation results on multiple benchmark datasets and LLMs show that DataSentinel effectively detects both existing and adaptive prompt injection attacks.
Problem

Research questions and friction points this paper is trying to address.

Detects prompt injection attacks in LLM-integrated applications
Improves detection against adaptive and state-of-the-art attacks
Uses game-theoretic minimax optimization for robust defense
Innovation

Methods, ideas, or system contributions that make the work stand out.

Game-theoretic method for prompt injection detection
Minimax optimization to counter adaptive attacks
Gradient-based alternating optimization solution
🔎 Similar Papers
No similar papers found.