Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the challenge that existing text-level defenses struggle to detect multi-turn prompt injection attacks, which appear benign in individual turns yet collectively form a malicious trajectory. The study is the first to reveal that such attacks induce quantifiable “adversarial agitation” patterns in the residual stream of large language models and proposes a five-dimensional activation trajectory feature for dialogue-level detection. The approach integrates model-adaptive probing, three-stage fine-grained annotation, multi-source training, and leave-one-source-out evaluation. Evaluated on models ranging from 24B to 70B parameters, it achieves an 89.4% detection rate (2.4% false positive rate) on mixed test sets and 47–71% detection accuracy on real-world LMSYS data under single-source training—substantially outperforming baselines relying solely on dialogue-level labels.

📝 Abstract

Multi-turn prompt injection follows a known attack path -- trust-building, pivoting, escalation but text-level defenses miss covert attacks where individual turns appear benign. We show this attack path leaves an activation-level signature in the model's residual stream: each phase shift moves the activation, producing a total path length far exceeding benign conversations. We call this adversarial restlessness. Five scalar trajectory features capturing this signal lift conversation-level detection from 76.2% to 93.8% on synthetic held-out data. The signal replicates across four model families (24B-70B); probes are model-specific and do not transfer across architectures. Generalization is source-dependent: leave-one-source-out evaluation shows each of synthetic, LMSYS-Chat-1M, and SafeDialBench captures distinct attack distributions, with detection on real-world LMSYS reaching 47-71% when its distribution is represented in training. Combined three-source training achieves 89.4% detection at 2.4% false positive rate on a held-out mixed set. We further show that three-phase turn-level labels(benign/pivoting/adversarial) unique to our synthetic dataset are essential: binary conversation-level labels produce 50-59% false positives. These results establish adversarial restlessness as a reliable activation-level signal and characterize the data requirements for practical deployment.

Problem

Research questions and friction points this paper is trying to address.

multi-turn prompt injection

adversarial detection

LLM activations

covert attacks

adversarial restlessness

Innovation

Methods, ideas, or system contributions that make the work stand out.

adversarial restlessness

activation trajectory

multi-turn prompt injection