The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Large language models (LLMs) face a safety-alignment dilemma: they are vulnerable to adversarial attacks that elicit harmful outputs, yet frequently over-reject benign but sensitive queries. Existing approaches rely on rigid, all-or-nothing rejection policies, lacking fine-grained response guidance. This work proposes a collaborative multi-agent framework that formalizes safety alignment as a positive-sum game, introducing Dynamic Improvement Reward (DIR) and an adaptive feedback mechanism. Within this framework, a dialogue agent and a feedback agent co-evolve during inference, enabling real-time response refinement—not mere rejection. Implemented atop the WaltzRL framework, the two agents are jointly trained. Experiments across five benchmark datasets demonstrate substantial improvements: unsafe response rate drops from 39.0% to 4.6%, and over-rejection rate declines from 45.3% to 9.9%, significantly enhancing the trade-off between safety and usability.

Technology Category

Application Category

📝 Abstract

Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.

Problem

Research questions and friction points this paper is trying to address.

Addresses tension between helpfulness and harmlessness in LLMs

Reduces both unsafe responses and excessive refusal rates

Enables collaborative safety alignment through multi-agent training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent reinforcement learning framework for safety alignment

Dynamic Improvement Reward evolving based on feedback incorporation

Feedback agent adaptively engaged to improve unsafe responses

🔎 Similar Papers

No similar papers found.