🤖 AI Summary
To address performance bottlenecks in offline multi-objective alignment of large language models—specifically, inadequate preference representation and imbalanced reward scoring—this paper proposes a novel data filtering paradigm grounded in preference direction modeling and Pareto frontier guidance. Methodologically: (1) human preferences are explicitly encoded as unit direction vectors in the objective space; (2) a two-stage mechanism is introduced—first identifying the neighborhood of the Pareto frontier, then dynamically sampling high-quality samples aligned with the target direction; (3) an end-to-end offline multi-objective alignment framework is developed, enabling customizable alignment behavior. Experiments demonstrate that our method significantly outperforms five baselines across two multi-objective alignment tasks, achieving simultaneous improvements in alignment quality, training efficiency, and objective diversity. To the best of our knowledge, this is the first approach to realize direction-controllable, data-adaptive offline multi-objective alignment.
📝 Abstract
Aligning large language models with multiple human expectations and values is crucial for ensuring that they adequately serve a variety of user needs. To this end, offline multiobjective alignment algorithms such as the Rewards-in-Context algorithm have shown strong performance and efficiency. However, inappropriate preference representations and training with imbalanced reward scores limit the performance of such algorithms. In this work, we introduce ParetoHqD that addresses the above issues by representing human preferences as preference directions in the objective space and regarding data near the Pareto front as ''high-quality'' data. For each preference, ParetoHqD follows a two-stage supervised fine-tuning process, where each stage uses an individual Pareto high-quality training set that best matches its preference direction. The experimental results have demonstrated the superiority of ParetoHqD over five baselines on two multiobjective alignment tasks.