Attention Tracker: Detecting Prompt Injection Attacks in LLMs

📅 2024-11-01
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
Prompt injection attacks against large language models (LLMs) are highly stealthy and difficult to detect. Method: This paper proposes a real-time detection method that requires no fine-tuning, gradient computation, or re-inference. Its core innovation is the first formal definition and modeling of the “interference effect”—an anomalous attention shift toward the instruction region triggered by adversarial prompts in specific attention heads. Based on this, we introduce Attention Tracker, a lightweight, training-free, and model-agnostic framework that detects attacks via multi-head attention pattern analysis, identification of critical attention heads, and trajectory tracking of attention weights over the instruction region. Results: Evaluated across multiple LLMs, attack types, and datasets, Attention Tracker achieves up to a 10.0% improvement in AUROC and remains effective even on small-scale LLMs, demonstrating strong generalizability and deployment efficiency.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks, where malicious inputs manipulate the model into ignoring original instructions and executing designated action. In this paper, we investigate the underlying mechanisms of these attacks by analyzing the attention patterns within LLMs. We introduce the concept of the distraction effect, where specific attention heads, termed important heads, shift focus from the original instruction to the injected instruction. Building on this discovery, we propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks without the need for additional LLM inference. Our method generalizes effectively across diverse models, datasets, and attack types, showing an AUROC improvement of up to 10.0% over existing methods, and performs well even on small LLMs. We demonstrate the robustness of our approach through extensive evaluations and provide insights into safeguarding LLM-integrated systems from prompt injection vulnerabilities.
Problem

Research questions and friction points this paper is trying to address.

Detects prompt injection attacks in LLMs by analyzing attention patterns
Identifies distraction effect where attention shifts to malicious instructions
Proposes training-free method to safeguard LLMs from injection vulnerabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes attention patterns in LLMs
Tracks distraction effect in attention heads
Training-free detection of prompt injection
🔎 Similar Papers
No similar papers found.