The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the limitation of conventional self-play red-teaming approaches, wherein shared model parameters between attacker and defender often lead to trivial self-consistent behaviors—such as uniformly rejecting all inputs—that fail to exert meaningful adversarial pressure. To overcome this, the authors propose an anchored dual-policy self-play mechanism that trains separate LoRA adapters for each role atop a frozen large base model, thereby enforcing role separation and sustaining persistent adversarial dynamics. This approach is the first to identify and mitigate the “self-consistency collapse” induced by parameter sharing, establishing an efficient and highly adversarial safety alignment paradigm. Evaluated on Qwen2.5-{3B,7B,14B}-IT models, the method achieves substantial gains in safety robustness with only ~1% of the parameters required for full fine-tuning, without compromising reasoning capabilities, and demonstrates markedly superior attack-defense performance over standard self-play baselines in cross-play evaluations.

📝 Abstract

Self-play red team is an established approach to improving AI safety in which different instances of the same model play attacker and defender roles in a zero-sum game, i.e., where the attacker tries to jailbreak the defender; if self-play converges to a Nash equilibrium, the model is guaranteed to respond safely within the settings of the game. Although the parameter sharing enforced by the use of the same model for the two roles improves stability and performance, it introduces fundamental theoretical and architectural limitations. We show that the set of Nash equilibria that can be reached corresponds to a broad class of behaviours that includes trivial always refuse strategies and oracle-like defenders, thus limiting practical applicability. We then show that when attacker and defender share and update the same base model, the dynamics collapse to self-consistency, so that attacks do not enforce adversarial pressure on the defender. In response, we propose Anchored Bipolicy Self-Play, which trains distinct role-specific LoRA adapters on top of a frozen base model, thereby maintaining stable optimisation while preserving adversarial pressure through explicit role separation. In relation to standard self-play, we show up to 100x greater parameter efficiency than finetuning and consistent improvements in safety compared to self-play fine-tuned models. We evaluate on Qwen2.5-{3B, 7B,14B}-IT models across widely used safety benchmarks, showing improved robustness without loss of reasoning ability. Cross-play experiments further show that our attacker and defender models are superior to self-play in terms of adversarial defence and safety.

Problem

Research questions and friction points this paper is trying to address.

self-play

AI safety

self-consistency

adversarial pressure

Nash equilibrium

Innovation

Methods, ideas, or system contributions that make the work stand out.

Anchored Bipolicy Self-Play

role separation

LoRA adapters