🤖 AI Summary
Existing RLVR methods for verifiable tasks (e.g., mathematical reasoning) rely on hand-crafted directional priors (e.g., “higher is better”), making them sensitive to hyperparameters and prone to performance degradation. This work proposes CANON, a conditional advantage estimation framework that eliminates the need for predefined metric directionality. CANON dynamically identifies beneficial reasoning patterns by grouping responses—e.g., by entropy or length—and performing inter-group comparisons to amplify advantage signals under verifiable rewards. Its core innovation lies in decoupling metric directionality from advantage estimation, thereby mitigating human bias. Experiments demonstrate that entropy-based CANON consistently outperforms SOTA across three major LMs. Moreover, length-based grouping significantly improves token efficiency, achieving a superior Pareto frontier between performance and computational cost.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) for large language models (LLMs) has achieved remarkable progress in enhancing LLMs' reasoning capabilities on tasks with clear correctness criteria, such as mathematical reasoning tasks. Several training metrics, such as entropy or response length, have been observed to correlate with different reasoning behaviors in reinforcement learning. Prior approaches incorporate such priors through reward or advantage shaping, which often relies on hand-crafted penalties and preferences (e.g., higher-is-better or lower-is-better). However, without careful hyperparameter tuning, these directional priors can be overly biased and may lead to failure. To this end, we introduce Conditional advANtage estimatiON (CANON), amplifying the impact of the target metric without presuming its direction. Specifically, CANON regroups the sampled responses into two groups based on the higher or lower value of a target metric, measures which metric trend contributes to better performance through inter-group comparison, and identifies the better response within the same group. In summary, CANON based on entropy consistently outperforms prior methods across three LLMs on both math reasoning and high-complexity logic tasks. When applied to response length, CANON further improves token efficiency, yielding a more favorable Pareto frontier in the performance-cost trade-off.