Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the zero-sum trade-offs between competing human preferences—such as safety and helpfulness—in multi-objective alignment of large language models, a challenge that conventional approaches struggle to resolve due to their reliance on fixed Pareto fronts. The study identifies, for the first time, the structural limitations inherent in prompts as the fundamental cause of unattainable multi-dimensional rewards. To overcome this, the authors propose MORA (Multi-Objective Rewriting Alignment), a method that expands the reward space by semantically rewriting prompts to integrate multiple intents. MORA synergistically combines large-model rollouts, multi-dimensional reward analysis, and pre-sampled single-reward prompts to jointly optimize conflicting objectives. Experiments demonstrate consistent improvements of 5%–12.4% across individual preference dimensions—particularly harmlessness—in sequential alignment, and a 4.6% average gain in overall reward under simultaneous alignment, thereby surpassing traditional Pareto constraints.

📝 Abstract

In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model's rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. Based on this core observation, we propose MORA: Multi-Objective Reward Assimilation. Specifically, MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents. Extensive experiments demonstrate that: (1) in sequential alignment, MORA achieves single-preference improvements ranging from 5% to 12.4%, with exceptional gains in harmlessness, after multiple-preference alignment across helpful, harmless, and truthful dimensions. (2) In simultaneous alignment, MORA achieves an average overall reward improvement of 4.6%. Our codes are available at https://anonymous.4open.science/r/MORA-MPA.

Problem

Research questions and friction points this paper is trying to address.

multi-objective alignment

preference trade-off

safety-helpfulness conflict

Pareto frontier

reward dimensionality

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-objective alignment

preference dimensional expansion

reward diversity