Building Better Deception Probes Using Targeted Instruction Pairs

📅 2026-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of existing linear probes to spurious correlations when detecting AI deception, which often leads to false positives on non-deceptive responses. To mitigate this, the authors propose a targeted instruction-pair design method grounded in a taxonomy of deceptive behaviors. By constructing interpretable instruction pairs that isolate specific deception types, the approach trains linear probes to focus on deceptive intent rather than superficial content patterns. Experimental results demonstrate that instruction selection is the dominant factor in probe performance, accounting for 70.6% of variance. Probes tailored to specific threat models significantly outperform general-purpose detectors, achieving higher detection accuracy and lower false-positive rates on evaluation datasets. This study underscores the importance of aligning probing mechanisms with concrete deception categories and establishes a new paradigm for interpretable AI safety evaluation.

Technology Category

Application Category

📝 Abstract
Linear probes are a promising approach for monitoring AI systems for deceptive behaviour. Previous work has shown that a linear classifier trained on a contrastive instruction pair and a simple dataset can achieve good performance. However, these probes exhibit notable failures even in straightforward scenarios, including spurious correlations and false positives on non-deceptive responses. In this paper, we identify the importance of the instruction pair used during training. Furthermore, we show that targeting specific deceptive behaviors through a human-interpretable taxonomy of deception leads to improved results on evaluation datasets. Our findings reveal that instruction pairs capture deceptive intent rather than content-specific patterns, explaining why prompt choice dominates probe performance (70.6% of variance). Given the heterogeneity of deception types across datasets, we conclude that organizations should design specialized probes targeting their specific threat models rather than seeking a universal deception detector.
Problem

Research questions and friction points this paper is trying to address.

deception detection
linear probes
AI safety
spurious correlations
false positives
Innovation

Methods, ideas, or system contributions that make the work stand out.

targeted instruction pairs
deception probes
linear probing
deceptive intent
taxonomy of deception
🔎 Similar Papers
No similar papers found.
V
Vikram Natarajan
LASR Labs
D
Devina Jain
LASR Labs
S
Shivam Arora
LASR Labs
Satvik Golechha
Satvik Golechha
Research Scientist, AISI
AGI securityalignmentinterpretabilityreinforcement learning
J
Joseph Bloom
UK AI Security Institute