Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

📅 2026-05-03

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Current large language models lack effective defenses against role-based jailbreak attacks and struggle to balance safety with general capabilities. This work proposes the Persona-Invariant Alignment (PIA) framework, which introduces a structural disentanglement hypothesis to decouple safety-aligned intent from persona identity via a unidirectional KL divergence constraint. PIA further establishes an adversarial self-play mechanism enabling co-evolution of attack and defense strategies: the attacker explores high-risk persona spaces through Persona Lineage Evolution, while the defender enforces persona-invariant safety decisions via Persona-Invariant Consistency Learning. Experimental results demonstrate that PIA substantially reduces the success rate of persona-based jailbreak attacks without compromising the model’s general performance, thereby validating its robustness and superiority.

📝 Abstract

The growing capabilities of large language models (LLMs) have driven their widespread deployment across diverse domains, even in potentially high-risk scenarios. Despite advances in safety alignment techniques, current models remain vulnerable to emerging persona-based jailbreak attacks. Existing research on persona-based jailbreak has primarily focused on attack iterations, yet it lacks systemic and mechanistic constraints on the defense side. To address this challenge, we propose Persona-Invariant Alignment (PIA), an adversarial self-play framework that achieves co-evolution through Persona Lineage Evolution (PLE) on the attack side and Persona-Invariant Consistency Learning (PICL) on the defense side. Theoretically, PICL is grounded in the structural separation hypothesis, using a unilateral KL-divergence constraint to enable the structural decoupling of safety decisions from persona context, thereby maintaining safe behavior under persona-based jailbreak attacks. Experimental results demonstrate that PLE efficiently explores high-risk persona spaces by leveraging lineage-based credit propagation. Meanwhile, the PICL defense method significantly reduces the Attack Success Rate (ASR) while preserving the model's general capability, thereby validating the superiority and robustness of this alignment paradigm. Codes are available at https://github.com/JiajiaLi-1130/PIA.

Problem

Research questions and friction points this paper is trying to address.

persona-based jailbreak

safety alignment

large language models

adversarial robustness

role-invariant safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Persona-Invariant Alignment

Adversarial Self-Play

Structural Disentanglement