🤖 AI Summary
This work investigates the failure modes of activation probing under realistic black-box adversarial settings and proposes an efficient detection methodology. Addressing the limitation of existing evaluations—namely, their reliance on white-box assumptions that obscure real-world deployment vulnerabilities—we introduce the first prompt-driven, lightweight red-teaming framework. Leveraging zero-shot large language models, it integrates iterative feedback, in-context learning, and scenario-constrained attacks, requiring neither fine-tuning nor gradient access for black-box robustness assessment. Our approach systematically uncovers interpretable false positives (e.g., legal terminology triggering spurious activations) and false negatives (e.g., procedural phrasing suppressing responses). Crucially, it consistently identifies critical vulnerabilities in state-of-the-art probes during high-stakes interactions, empirically validating its effectiveness and practical utility for pre-deployment failure prediction.
📝 Abstract
Activation probes are attractive monitors for AI systems due to low cost and latency, but their real-world robustness remains underexplored. We ask: What failure modes arise under realistic, black-box adversarial pressure, and how can we surface them with minimal effort? We present a lightweight black-box red-teaming procedure that wraps an off-the-shelf LLM with iterative feedback and in-context learning (ICL), and requires no fine-tuning, gradients, or architectural access. Running a case study with probes for high-stakes interactions, we show that our approach can help discover valuable insights about a SOTA probe. Our analysis uncovers interpretable brittleness patterns (e.g., legalese-induced FPs; bland procedural tone FNs) and reduced but persistent vulnerabilities under scenario-constraint attacks. These results suggest that simple prompted red-teaming scaffolding can anticipate failure patterns before deployment and might yield promising, actionable insights to harden future probes.