Pinpointing crucial steps: Attribution-based Credit Assignment for Verifiable Reinforcement Learning

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In verifiable reward reinforcement learning (RLVR), imbalanced exploration–exploitation leads to inaccurate credit assignment and entropy collapse. To address this, we propose ACPO, a phased framework. Its core contributions are: (1) a dynamic policy entropy regulation mechanism grounded in semantic trajectory segmentation and attribution-aware representation, enabling difficulty-aware adaptive exploration; (2) factorized reward modeling coupled with hierarchical credit assignment, precisely attributing reward signals to critical reasoning steps; and (3) a synergistic optimization paradigm integrating attribution-based contribution evaluation with difficulty-aware curriculum learning. Evaluated on high-difficulty mathematical reasoning benchmarks—including AIME, MATH, and AMC—ACPO consistently outperforms existing state-of-the-art methods, demonstrating superior effectiveness, robustness, and generalization in complex multi-step reasoning tasks.

Technology Category

Application Category

📝 Abstract
While Reinforcement Learning with Verifiable Rewards (RLVR) enhances complex reasoning in LLMs, current methods struggle to balance exploration and exploitation. This leads to critical issues like inaccurate credit assignment for intermediate steps and premature entropy collapse, limiting model performance. To address this, we introduce Attribution-based Contribution to Policy Optimization (ACPO), a phased framework that incorporates a difficulty-aware curriculum. ACPO improves exploration by using trajectory semantic segmentation and an attribution-based representation to dynamically regulate policy entropy, thus mitigating its collapse. Concurrently, it enhances exploitation with a factorized reward system that precisely quantifies the hierarchical contribution of each reasoning step, ensuring accurate credit assignment. Extensive experiments on challenging benchmarks, including AIME, MATH, and AMC, demonstrate that ACPO significantly outperforms existing state-of-the-art approaches.
Problem

Research questions and friction points this paper is trying to address.

Balancing exploration and exploitation in verifiable reinforcement learning
Addressing inaccurate credit assignment for intermediate reasoning steps
Preventing premature entropy collapse during policy optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attribution-based phased framework for policy optimization
Trajectory segmentation dynamically regulates policy entropy
Factorized reward system quantifies hierarchical step contributions
🔎 Similar Papers
No similar papers found.
H
Haisen Luo
Institute of Artificial Intelligence, Taikang Insurance Group Inc
Z
Zhenyu Li
Institute of Artificial Intelligence, Taikang Insurance Group Inc
Y
Yihua Liu
Institute of Artificial Intelligence, Taikang Insurance Group Inc
J
Junxi Yin
Institute of Artificial Intelligence, Taikang Insurance Group Inc
D
Dan Liu
Institute of Artificial Intelligence, Taikang Insurance Group Inc
Z
Zequn Li
Institute of Artificial Intelligence, Taikang Insurance Group Inc
Xiaohang Xu
Xiaohang Xu
Postdoc at the University of Tokyo
Spatial-temporal data miningRecommendation systemFederated learning