Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RLVR methods for verifiable tasks (e.g., mathematical reasoning) rely on hand-crafted directional priors (e.g., “higher is better”), making them sensitive to hyperparameters and prone to performance degradation. This work proposes CANON, a conditional advantage estimation framework that eliminates the need for predefined metric directionality. CANON dynamically identifies beneficial reasoning patterns by grouping responses—e.g., by entropy or length—and performing inter-group comparisons to amplify advantage signals under verifiable rewards. Its core innovation lies in decoupling metric directionality from advantage estimation, thereby mitigating human bias. Experiments demonstrate that entropy-based CANON consistently outperforms SOTA across three major LMs. Moreover, length-based grouping significantly improves token efficiency, achieving a superior Pareto frontier between performance and computational cost.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) for large language models (LLMs) has achieved remarkable progress in enhancing LLMs' reasoning capabilities on tasks with clear correctness criteria, such as mathematical reasoning tasks. Several training metrics, such as entropy or response length, have been observed to correlate with different reasoning behaviors in reinforcement learning. Prior approaches incorporate such priors through reward or advantage shaping, which often relies on hand-crafted penalties and preferences (e.g., higher-is-better or lower-is-better). However, without careful hyperparameter tuning, these directional priors can be overly biased and may lead to failure. To this end, we introduce Conditional advANtage estimatiON (CANON), amplifying the impact of the target metric without presuming its direction. Specifically, CANON regroups the sampled responses into two groups based on the higher or lower value of a target metric, measures which metric trend contributes to better performance through inter-group comparison, and identifies the better response within the same group. In summary, CANON based on entropy consistently outperforms prior methods across three LLMs on both math reasoning and high-complexity logic tasks. When applied to response length, CANON further improves token efficiency, yielding a more favorable Pareto frontier in the performance-cost trade-off.
Problem

Research questions and friction points this paper is trying to address.

Estimating advantages without presuming metric directionality in reinforcement learning
Improving reasoning performance while maintaining token efficiency in LLMs
Overcoming biased priors in reward shaping for mathematical and logic tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional Advantage Estimation amplifies metric impact without direction bias
CANON regroups responses by metric value for inter-group comparison
Method improves performance and token efficiency across reasoning tasks
🔎 Similar Papers
No similar papers found.