UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing RLHF methods struggle to model the multidimensionality, distributional nature, and fine-grained distinctions inherent in human preferences, while suffering from high numerical sensitivity and computational overhead. This paper proposes a utility-function-based conditional alignment framework. It introduces a novel symbolic utility mapping mechanism that converts continuous rewards into robust symbolic tokens, and employs a strictly increasing nonlinear utility family combined with conditional fine-tuning. For the first time, this enables simultaneous distribution-level Pareto frontier optimization and cross-dimensional alignment stability within a single model. Experiments demonstrate that our method substantially mitigates numerical sensitivity—e.g., reliably distinguishing scores of 9.11 versus 9.8—reduces training cost by multiple-fold, and surpasses state-of-the-art approaches (e.g., RiC, MORLHF) in both Pareto frontier quality and robustness across reward dimensions.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models (LLMs) with human values. However, existing approaches struggle to capture the multi-dimensional, distributional nuances of human preferences. Methods such as RiC that directly inject raw reward values into prompts face significant numerical sensitivity issues--for instance, LLMs may fail to distinguish between 9.11 and 9.8--while alternatives like MORLHF, Rewarded Soups, and MODPO incur high computational costs by training multiple models. In this work, we introduce Utility-Conditioned Multi-Objective Alignment (UC-MOA), a novel framework that overcomes these limitations. Our approach leverages a diverse set of strictly increasing, non-linear utility functions to transform user-specified preferences into symbolic tokens, which are then used to condition a single LLM. This design not only mitigates numerical reasoning challenges but also substantially reduces training overhead, yielding models that achieve superior Pareto fronts and robust alignment across complex reward dimensions.

Problem

Research questions and friction points this paper is trying to address.

Addresses multi-dimensional human preference alignment in LLMs

Overcomes numerical sensitivity in reward-based RLHF methods

Reduces computational costs in multi-objective alignment frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

UC-MOA uses non-linear utility functions for alignment.

Transforms preferences into symbolic tokens for conditioning.

Reduces training overhead while improving Pareto fronts.

🔎 Similar Papers

Divide and Conquer: Provably Unveiling the Pareto Front with Multi-Objective Reinforcement Learning