SPARD: Self-Paced Curriculum for RL Alignment via Integrating Reward Dynamics and Data Utility

πŸ“… 2026-04-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of conventional reinforcement learning alignment methods, which rely on fixed reward weights and struggle to handle non-stationary dynamics and cross-dimensional data heterogeneity in multi-objective settings. To overcome these challenges, the study introduces self-paced curriculum learning into multi-objective reinforcement learning alignment for the first time, proposing an end-to-end dynamic framework that jointly adapts reward dynamics and data utility. By sensing learning progress, the framework co-optimizes reward weights and data importance in a coordinated manner. Experimental results demonstrate that the proposed method significantly enhances model performance across multiple benchmark tasks, confirming its effectiveness and generalization capability.
πŸ“ Abstract
The evolution of Large Language Models (LLMs) is shifting the focus from single, verifiable tasks toward complex, open-ended real-world scenarios, imposing significant challenges on the post-training phase. In these settings, the scale and complexity of reward systems have grown significantly, transitioning toward multi-objective formulations that encompass a comprehensive spectrum of model capabilities and application contexts. However, traditional methods typically rely on fixed reward weights, ignoring non-stationary learning dynamics and struggling with data heterogeneity across dimensions. To address these issues, we propose SPARD, a framework that establishes an automated, self-paced curriculum by perceiving learning progress to dynamically adjust multi-objective reward weights and data importance, thereby synchronizing learning intent with data utility for optimal performance. Extensive experiments across multiple benchmarks demonstrate that SPARD significantly enhances model capabilities across all domains.
Problem

Research questions and friction points this paper is trying to address.

reward dynamics
data heterogeneity
multi-objective reinforcement learning
self-paced curriculum
RL alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Paced Curriculum
Reward Dynamics
Data Utility
Multi-Objective Reinforcement Learning
LLM Alignment
πŸ”Ž Similar Papers
No similar papers found.
X
Xuyang Zhi
University of Science and Technology of China
P
Peilun Zhou
Xiaohongshu Inc.
Chengqiang Lu
Chengqiang Lu
USTC
H
Hang Lv
University of Science and Technology of China
Y
Yiwei Liang
University of Science and Technology of China
R
Rongyang Zhang
University of Science and Technology of China
Yan Gao
Yan Gao
XiaoHongShu Inc
ζœΊε™¨ε­¦δΉ 
Y
Yi Wu
Xiaohongshu Inc.
Yao Hu
Yao Hu
ζ΅™ζ±Ÿε€§ε­¦
Machine Learning
H
Hongchao Gu
University of Science and Technology of China
H
Hao Wang
University of Science and Technology of China
D
Defu Lian
University of Science and Technology of China
Enhong Chen
Enhong Chen
University of Science and Technology of China
data miningrecommender systemmachine learning