🤖 AI Summary
Existing RLHF methods struggle to model the multidimensionality, distributional nature, and fine-grained distinctions inherent in human preferences, while suffering from high numerical sensitivity and computational overhead. This paper proposes a utility-function-based conditional alignment framework. It introduces a novel symbolic utility mapping mechanism that converts continuous rewards into robust symbolic tokens, and employs a strictly increasing nonlinear utility family combined with conditional fine-tuning. For the first time, this enables simultaneous distribution-level Pareto frontier optimization and cross-dimensional alignment stability within a single model. Experiments demonstrate that our method substantially mitigates numerical sensitivity—e.g., reliably distinguishing scores of 9.11 versus 9.8—reduces training cost by multiple-fold, and surpasses state-of-the-art approaches (e.g., RiC, MORLHF) in both Pareto frontier quality and robustness across reward dimensions.
📝 Abstract
Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models (LLMs) with human values. However, existing approaches struggle to capture the multi-dimensional, distributional nuances of human preferences. Methods such as RiC that directly inject raw reward values into prompts face significant numerical sensitivity issues--for instance, LLMs may fail to distinguish between 9.11 and 9.8--while alternatives like MORLHF, Rewarded Soups, and MODPO incur high computational costs by training multiple models. In this work, we introduce Utility-Conditioned Multi-Objective Alignment (UC-MOA), a novel framework that overcomes these limitations. Our approach leverages a diverse set of strictly increasing, non-linear utility functions to transform user-specified preferences into symbolic tokens, which are then used to condition a single LLM. This design not only mitigates numerical reasoning challenges but also substantially reduces training overhead, yielding models that achieve superior Pareto fronts and robust alignment across complex reward dimensions.