π€ AI Summary
This work proposes a three-agent closed-loop reinforcement learning framework to mitigate the risk of large language models generating toxic or harmful content. By orchestrating a self-play process among an attacker, a defender, and an evaluator, the approach achieves continuous safety alignment with minimal human annotation. The framework unifies adversarial prompt generation, defensive response generation, and fine-grained safety evaluation within a co-evolutionary learning loop. Experimental results demonstrate that the attacker improves adversarial effectiveness by 20%β50% while maintaining diversity; the defender enhances safety performance by 10%β30% without compromising reasoning capabilities; and the evaluator steadily increases its discriminative accuracy in distinguishing unsafe responses, simplistic refusals, and helpful guidance.
π Abstract
In recent years, safety risks associated with large language models have become increasingly prominent, highlighting the urgent need to mitigate the generation of toxic and harmful content. The mainstream paradigm for LLM safety alignment typically adopts a collaborative framework involving three roles: an attacker for adversarial prompt generation, a defender for safety defense, and an evaluator for response assessment. In this paper, we propose a closed-loop reinforcement learning framework called TriPlay-RL that enables iterative and co-improving collaboration among three roles with near-zero manual annotation. Experimental results show that the attacker preserves high output diversity while achieving a 20%-50% improvement in adversarial effectiveness; the defender attains 10%-30% gains in safety performance without degrading general reasoning capability; and the evaluator continuously refines its fine-grained judgment ability through iterations, accurately distinguishing unsafe responses, simple refusals, and useful guidance. Overall, our framework establishes an efficient and scalable paradigm for LLM safety alignment, enabling continuous co-evolution within a unified learning loop.