ADORA: Training Reasoning Models with Dynamic Advantage Estimation on Reinforcement Learning

📅 2026-02-10

📈 Citations: 2

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the inefficiency of static advantage functions in traditional reinforcement learning, which struggle to adapt to the dynamic utility of training samples, leading to suboptimal credit assignment, unstable convergence, and degraded policy updates. To overcome this limitation, the authors propose ADORA, a novel framework that introduces a dynamic advantage estimation mechanism. ADORA performs online rollouts to continuously evaluate sample utility, adaptively partitions data into favorable and unfavorable categories, and dynamically reweights the advantage function to refine policy gradients. Notably, this approach requires no architectural modifications and demonstrates consistent performance gains across varying model sizes and datasets. It significantly enhances long-chain reasoning capabilities in geometric and mathematical reasoning tasks while exhibiting robustness to hyperparameter choices.

Technology Category

Application Category

📝 Abstract

Reinforcement learning has become a cornerstone technique for developing reasoning models in complex tasks, ranging from mathematical problem-solving to imaginary reasoning. The optimization of these models typically relies on policy gradient methods, whose efficacy hinges on the accurate estimation of an advantage function. However, prevailing methods typically employ static advantage estimation, a practice that leads to inefficient credit assignment by neglecting the dynamic utility of training samples over time. This limitation results in suboptimal policy updates, which in turn manifest as slower convergence rates and increased learning instability, as models fail to adapt to evolving sample utilities effectively. To address this problem, we introduce \textbf{ADORA} (\textbf{A}dvantage \textbf{D}ynamics via \textbf{O}nline \textbf{R}ollout \textbf{A}daptation), a novel framework for policy optimization. ADORA dynamically adjusts the advantage function's weighting by adaptively categorizing training data into temporarily advantageous and disadvantageous samples, based on their evolving utility during online model rollouts. This tailored data differentiation strategy allows ADORA to be seamlessly integrated into existing policy optimization algorithms without significant architectural modifications, enabling the policy to prioritize learning from more informative experiences and thereby achieve more efficient policy updates. Extensive evaluations across diverse model families and varying data scales demonstrate that ADORA is a robust and efficient framework. It significantly enhances long reasoning in both geometric and mathematical tasks, consistently achieving notable performance gains without requiring sensitive hyperparameter tuning.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

reasoning models

advantage estimation

credit assignment

policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Advantage Estimation

Online Rollout Adaptation

Policy Optimization