🤖 AI Summary
Existing streaming detection methods for identifying harmful intent in large language models are prone to false positives due to reliance on isolated sensitive tokens, particularly undermining reliability in high-stakes scenarios such as CBRN threats. This work proposes a segment-level consistency-based streaming detection mechanism that requires multiple evidence tokens to jointly support a prediction, rather than depending on a single high-scoring token, thereby substantially enhancing robustness. Analysis of internal activations reveals that features from attention or MLP layers outperform those from residual streams. The proposed method also demonstrates plug-and-play capability for character-level encrypted attack detection. Experimental results show that, at a 1% false positive rate, the true positive rate improves by 35.55% over strong baselines, achieving an AUROC of 98.85%—a significant gain even against a near-saturated baseline performance of 97.40%.
📝 Abstract
Large Language Models (LLMs) are increasingly exposed to adaptive jailbreaking, particularly in high-stakes Chemical, Biological, Radiological, and Nuclear (CBRN) domains. Although streaming probes enable real-time monitoring, they still make systematic errors. We identify a core issue: existing methods often rely on a few high-scoring tokens, leading to false alarms when sensitive CBRN terms appear in benign contexts. To address this, we introduce a streaming probing objective that requires multiple evidence tokens to consistently support a prediction, rather than relying on isolated spikes. This encourages more robust detection based on aggregated signals instead of single-token cues. At a fixed 1% false-positive rate, our method improves the true-positive rate by 35.55% relative to strong streaming baselines. We further observe substantial gains in AUROC, even when starting from near-saturated baseline performance (AUROC = 97.40%). We also show that probing Attention or MLP activations consistently outperforms residual-stream features. Finally, even when adversarial fine-tuning enables novel character-level ciphers, harmful intent remains detectable: probes developed for the base LLMs can be applied ``plug-and-play'' to these obfuscated attacks, achieving an AUROC of over 98.85%.