🤖 AI Summary
Existing scalarization approaches in multi-reward reinforcement learning, such as reward or advantage aggregation, often suffer from training instability due to neglecting inter-objective correlations. This work proposes Dynamic Variance-Adaptive Advantage Optimization, which operates within the Group Relative Policy Optimization framework and dynamically adjusts combination weights based on the empirical variance of each objective within rollout groups. By amplifying signals from high-confidence objectives and suppressing noisy ones, and further incorporating adaptive cross-objective regularization to bound advantage magnitudes, the method ensures stable optimization without requiring a value model. Evaluated on mathematical reasoning and tool-use tasks with Qwen3 and Qwen2.5, it significantly outperforms baseline methods, achieving superior Pareto fronts and enhanced training stability.
📝 Abstract
Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such as Reward Combination and Advantage Combination, suffer from significant drawbacks: Reward Combination frequently generates advantages with excessively large squared magnitudes that lead to training instability, while Advantage Combination relies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superior multi-objective Pareto frontier and robust training stability.