Attention Tracker: Detecting Prompt Injection Attacks in LLMs

📅 2024-11-01

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Prompt injection attacks against large language models (LLMs) are highly stealthy and difficult to detect. Method: This paper proposes a real-time detection method that requires no fine-tuning, gradient computation, or re-inference. Its core innovation is the first formal definition and modeling of the “interference effect”—an anomalous attention shift toward the instruction region triggered by adversarial prompts in specific attention heads. Based on this, we introduce Attention Tracker, a lightweight, training-free, and model-agnostic framework that detects attacks via multi-head attention pattern analysis, identification of critical attention heads, and trajectory tracking of attention weights over the instruction region. Results: Evaluated across multiple LLMs, attack types, and datasets, Attention Tracker achieves up to a 10.0% improvement in AUROC and remains effective even on small-scale LLMs, demonstrating strong generalizability and deployment efficiency.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks, where malicious inputs manipulate the model into ignoring original instructions and executing designated action. In this paper, we investigate the underlying mechanisms of these attacks by analyzing the attention patterns within LLMs. We introduce the concept of the distraction effect, where specific attention heads, termed important heads, shift focus from the original instruction to the injected instruction. Building on this discovery, we propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks without the need for additional LLM inference. Our method generalizes effectively across diverse models, datasets, and attack types, showing an AUROC improvement of up to 10.0% over existing methods, and performs well even on small LLMs. We demonstrate the robustness of our approach through extensive evaluations and provide insights into safeguarding LLM-integrated systems from prompt injection vulnerabilities.

Problem

Research questions and friction points this paper is trying to address.

Detects prompt injection attacks in LLMs by analyzing attention patterns

Identifies distraction effect where attention shifts to malicious instructions

Proposes training-free method to safeguard LLMs from injection vulnerabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes attention patterns in LLMs

Tracks distraction effect in attention heads

Training-free detection of prompt injection

🔎 Similar Papers

Get my drift? Catching LLM Task Drift with Activation Deltas