CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RL-based verifiable reasoning (RLVR) methods treat an LLM’s entire response as a single action and assign homogeneous token-level rewards, resulting in coarse-grained credit assignment and suboptimal optimization of reasoning paths. To address this, we propose Credit Assignment Policy Optimization (CAPO), the first generative fine-grained credit assignment framework. CAPO introduces an LLM-as-Generative Process Reward Model (LLM-as-GenPRM) that produces multi-step, verifiable critiques in a single forward pass, with robustness enhanced via majority voting. It integrates rule-guided binary feedback, token-level reward allocation, and an offline RL optimization paradigm. Extensive experiments across six mathematical reasoning and three cross-domain benchmarks—using backbone models including Llama and Qwen of varying scales—demonstrate that CAPO significantly outperforms both supervised fine-tuning and state-of-the-art RLVR methods, yielding consistent improvements in reasoning accuracy and policy quality.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback, helping to mitigate reward hacking. However, current RLVR methods typically treat whole responses as single actions, assigning the same reward to every token. This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure, and often results in suboptimal policies and inefficient learning. Methods like PPO provide credit assignment through value estimation, but often yield inaccurate and unverifiable signals due to limited sampling. On the other hand, methods using Process Reward Models can provide step-by-step judgments for each reasoning step, but they require high-quality process supervision labels and are time-consuming when applied in online reinforcement learning (RL). To overcome these limitations, we introduce a simple but efficient method Credit Assignment Policy Optimization (CAPO). Given a reasoning response rollout from the policy model, CAPO directly leverages an off-the-shelf, general-purpose LLM as a Generative Process Reward Model (LLM-as-GenPRM) to generate all step-wise critique by one pass, thereby providing verifiable token-level rewards to refine the tokens that were originally assigned identical rule-based rewards. This enables more fine-grained credit assignment in an effective way. Furthermore, to enhance the accuracy and robustness of CAPO, we employ voting mechanisms that scale with the number of generated critiques. Extensive experiments using different backbones like Llama and Qwen models and in different sizes show that CAPO consistently outperforms supervised learning-based and RL-based fine-tuning methods across six challenging mathematical benchmarks and three out-of-domain benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Improving LLM reasoning via verifiable credit assignment
Addressing coarse-grained feedback in RLVR methods
Enhancing accuracy with generative process reward models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLM-as-GenPRM for step-wise critique
Provides verifiable token-level rewards
Employs voting mechanisms for robustness
🔎 Similar Papers
No similar papers found.
Guofu Xie
Guofu Xie
Renmin University of China
Large Language ModelReinforcement Learning
Y
Yunsheng Shi
Wechat Search, Tencent Inc
H
Hongtao Tian
Wechat Search, Tencent Inc
T
Ting Yao
Wechat Search, Tencent Inc
X
Xiao Zhang
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China