Reward-free Alignment for Conflicting Objectives

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the instability and suboptimal trade-offs in aligning large language models with multiple, often conflicting objectives. To overcome these challenges, the authors propose RACO, a framework that directly leverages pairwise preference data without requiring explicit reward models. RACO employs a clipped conflict-avoidance gradient descent method to harmonize optimization directions across objectives. The approach is theoretically guaranteed to converge to a weighted Pareto stationary point, and its gradient clipping mechanism substantially accelerates convergence. Experiments on summarization and safety alignment tasks with models such as Qwen3, Llama3, and Gemma3 demonstrate that RACO consistently outperforms existing baselines, achieving superior Pareto-optimal trade-offs among competing objectives.

Technology Category

Application Category

📝 Abstract

Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a Reward-free Alignment framework for Conflicted Objectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.

Problem

Research questions and friction points this paper is trying to address.

reward-free alignment

conflicting objectives

multi-objective alignment

preference aggregation

Pareto trade-offs

Innovation

Methods, ideas, or system contributions that make the work stand out.

reward-free alignment

conflicting objectives

Pareto optimization