WorldPM: Scaling Human Preference Modeling

📅 2025-05-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Human preference signals are fragmented across domains, lacking a unified representation and scalable characterization. Method: We propose World Preference Modeling (WorldPM), a framework for cross-domain preference representation learning, validated on 15M multi-source forum samples across 1.5B–72B parameter models. It introduces large-scale preference data construction, multi-scale training, three decoupled evaluation axes—adversarial, objective, and subjective—and integrates preference representation distillation with RLHF. Contribution/Results: We discover, for the first time, preference modeling’s language-model-like scaling law: adversarial and objective metrics scale consistently with model and data size, whereas subjective metrics do not. WorldPM achieves >5% average improvement across 20 subtasks on seven benchmarks, boosts internal RLHF pipelines by 4–8%, and significantly enhances generalization across sample sizes from 7K to 800K—establishing world preference as a transferable foundational model for preference learning.

Technology Category

Application Category

📝 Abstract
Motivated by scaling laws in language modeling that demonstrate how test loss scales as a power law with model and dataset sizes, we find that similar laws exist in preference modeling. We propose World Preference Modeling$ (WorldPM) to emphasize this scaling potential, where World Preference embodies a unified representation of human preferences. In this paper, we collect preference data from public forums covering diverse user communities, and conduct extensive training using 15M-scale data across models ranging from 1.5B to 72B parameters. We observe distinct patterns across different evaluation metrics: (1) Adversarial metrics (ability to identify deceptive features) consistently scale up with increased training data and base model size; (2) Objective metrics (objective knowledge with well-defined answers) show emergent behavior in larger language models, highlighting WorldPM's scalability potential; (3) Subjective metrics (subjective preferences from a limited number of humans or AI) do not demonstrate scaling trends. Further experiments validate the effectiveness of WorldPM as a foundation for preference fine-tuning. Through evaluations on 7 benchmarks with 20 subtasks, we find that WorldPM broadly improves the generalization performance across human preference datasets of varying sizes (7K, 100K and 800K samples), with performance gains exceeding 5% on many key subtasks. Integrating WorldPM into our internal RLHF pipeline, we observe significant improvements on both in-house and public evaluation sets, with notable gains of 4% to 8% in our in-house evaluations.
Problem

Research questions and friction points this paper is trying to address.

Scaling human preference modeling like language models
Evaluating adversarial, objective, subjective preference metrics
Improving generalization in human preference datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaling preference modeling with power laws
Training on 15M data across 1.5B-72B models
Improving RLHF pipeline performance by 4-8%
B
Binghai Wang
Qwen Team, Alibaba Group
Runji Lin
Runji Lin
Institute of Automation, Chinese Academy of Sciences
Reinforcement LearningMulti-Agent SystemLarge Language Model
K
Keming Lu
Qwen Team, Alibaba Group
L
Le Yu
Qwen Team, Alibaba Group
Zhenru Zhang
Zhenru Zhang
Qwen Team, Alibaba Group
Large Language Model
F
Fei Huang
Qwen Team, Alibaba Group
Chujie Zheng
Chujie Zheng
Qwen Team, Alibaba Group
Artifical IntelligenceLarge Language Models
K
Kai Dang
Qwen Team, Alibaba Group
Yang Fan
Yang Fan
University of Science and Technology of China
Learning to TeachAutomated Machine LearningNeural Architecture SearchNatural Language ProcessingAI for Medicine
X
Xingzhang Ren
Qwen Team, Alibaba Group
An Yang
An Yang
Qwen Team, Peking University
Nature Language Processing (NLP)
Binyuan Hui
Binyuan Hui
Qwen Team, Alibaba Group
Large Language ModelsCodeLLMsReasoningAgent
D
Dayiheng Liu
Qwen Team, Alibaba Group
T
Tao Gui
Institute of Trustworthy Embodied Artificial Intelligence, Fudan University; School of Computer Science, Fudan University
Q
Qi Zhang
Institute of Trustworthy Embodied Artificial Intelligence, Fudan University; School of Computer Science, Fudan University
X
Xuanjing Huang
Institute of Trustworthy Embodied Artificial Intelligence, Fudan University; School of Computer Science, Fudan University
Yu-Gang Jiang
Yu-Gang Jiang
Professor, Fudan University. IEEE & IAPR Fellow
Video AnalysisEmbodied AITrustworthy AI
Bowen Yu
Bowen Yu
Qwen Team, Alibaba Group
Post-trainingFoundation Model
Jingren Zhou
Jingren Zhou
Alibaba Group, Microsoft
Cloud ComputingLarge Scale Distributed SystemsMachine LearningQuery ProcessingQuery
Junyang Lin
Junyang Lin
Qwen Team, Alibaba Group & Peking University
Natural Language ProcessingCross-Modal Representation LearningPretraining