🤖 AI Summary
This work addresses the vulnerability of existing linear probes to spurious correlations when detecting AI deception, which often leads to false positives on non-deceptive responses. To mitigate this, the authors propose a targeted instruction-pair design method grounded in a taxonomy of deceptive behaviors. By constructing interpretable instruction pairs that isolate specific deception types, the approach trains linear probes to focus on deceptive intent rather than superficial content patterns. Experimental results demonstrate that instruction selection is the dominant factor in probe performance, accounting for 70.6% of variance. Probes tailored to specific threat models significantly outperform general-purpose detectors, achieving higher detection accuracy and lower false-positive rates on evaluation datasets. This study underscores the importance of aligning probing mechanisms with concrete deception categories and establishes a new paradigm for interpretable AI safety evaluation.
📝 Abstract
Linear probes are a promising approach for monitoring AI systems for deceptive behaviour. Previous work has shown that a linear classifier trained on a contrastive instruction pair and a simple dataset can achieve good performance. However, these probes exhibit notable failures even in straightforward scenarios, including spurious correlations and false positives on non-deceptive responses. In this paper, we identify the importance of the instruction pair used during training. Furthermore, we show that targeting specific deceptive behaviors through a human-interpretable taxonomy of deception leads to improved results on evaluation datasets. Our findings reveal that instruction pairs capture deceptive intent rather than content-specific patterns, explaining why prompt choice dominates probe performance (70.6% of variance). Given the heterogeneity of deception types across datasets, we conclude that organizations should design specialized probes targeting their specific threat models rather than seeking a universal deception detector.