TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

πŸ“… 2026-01-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes a three-agent closed-loop reinforcement learning framework to mitigate the risk of large language models generating toxic or harmful content. By orchestrating a self-play process among an attacker, a defender, and an evaluator, the approach achieves continuous safety alignment with minimal human annotation. The framework unifies adversarial prompt generation, defensive response generation, and fine-grained safety evaluation within a co-evolutionary learning loop. Experimental results demonstrate that the attacker improves adversarial effectiveness by 20%–50% while maintaining diversity; the defender enhances safety performance by 10%–30% without compromising reasoning capabilities; and the evaluator steadily increases its discriminative accuracy in distinguishing unsafe responses, simplistic refusals, and helpful guidance.

Technology Category

Application Category

πŸ“ Abstract
In recent years, safety risks associated with large language models have become increasingly prominent, highlighting the urgent need to mitigate the generation of toxic and harmful content. The mainstream paradigm for LLM safety alignment typically adopts a collaborative framework involving three roles: an attacker for adversarial prompt generation, a defender for safety defense, and an evaluator for response assessment. In this paper, we propose a closed-loop reinforcement learning framework called TriPlay-RL that enables iterative and co-improving collaboration among three roles with near-zero manual annotation. Experimental results show that the attacker preserves high output diversity while achieving a 20%-50% improvement in adversarial effectiveness; the defender attains 10%-30% gains in safety performance without degrading general reasoning capability; and the evaluator continuously refines its fine-grained judgment ability through iterations, accurately distinguishing unsafe responses, simple refusals, and useful guidance. Overall, our framework establishes an efficient and scalable paradigm for LLM safety alignment, enabling continuous co-evolution within a unified learning loop.
Problem

Research questions and friction points this paper is trying to address.

LLM safety
harmful content
safety alignment
toxic output
adversarial prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

TriPlay-RL
self-play reinforcement learning
LLM safety alignment
tri-role collaboration
adversarial training
πŸ”Ž Similar Papers
No similar papers found.
Z
Zhewen Tan
Peking University
W
Wenhan Yu
Peking University
J
Jianfeng Si
Qiyuan Tech
T
Tongxin Liu
Peking University
K
Kaiqi Guan
Peking University
H
Huiyan Jin
Peking University
J
J. Tao
Peking University
X
Xiaokun Yuan
Peking University
Duohe Ma
Duohe Ma
Associate Professor
Moving Target DefenseInformation SecurityNetwork SecurityCloud SecurityData Security
Xiangzheng Zhang
Xiangzheng Zhang
360
AI safetyLarge language modelsInformation Retrieval
Tong Yang
Tong Yang
Peking University, Beijing, China. PKU. εŒ—δΊ¬ε€§ε­¦
SketchNetwork measurementBloom filterIP lookupHash Table
Lin Sun
Lin Sun
Qihoo 360
large language model