🤖 AI Summary
This work addresses the instability of scalar advantage estimation in multi-task and mixed-reward reinforcement learning, which arises from heterogeneous reward distributions and inter-dimensional correlations. To jointly tackle reward normalization and dimensional correlation, the paper proposes Reward Decorrelation Policy Optimization (RDPO). RDPO stabilizes the advantage distribution for each reward type via magnitude-aware quantile normalization and eliminates redundant correlations by applying Mahalanobis whitening within the active reward subspace, thereby enabling more robust reward aggregation. Evaluated in post-training of the LongCat-Flash model, RDPO significantly enhances instruction-following capability, text generation quality, and robustness to challenging prompts, while maintaining competitive performance on reasoning and code-related tasks.
📝 Abstract
Complex reinforcement learning environments frequently employ multi-task and mixed-reward formulations. In these settings, heterogeneous reward distributions and correlated reward dimensions often destabilize the construction of scalar advantages. To address these challenges, we propose Reward-Decorrelated Policy Optimization (RDPO), a reward-processing method designed to explicitly target both failure modes. RDPO first utilizes Magnitude-Aware Quantile normalization to stabilize prompt-level advantage allocation across binary, fractional, and continuous rewards. It then applies Mahalanobis whitening within each active reward subspace to mitigate correlation redundancy prior to aggregation. When applied during the post-training of LongCat-Flash, RDPO enhances instruction following, writing quality, and robustness to hard prompts while remaining broadly competitive on reasoning and coding evaluations.