🤖 AI Summary
This work addresses the challenge in traditional reinforcement learning for multi-step reasoning with language models, where a single scalar reward fails to accurately attribute credit to individual reasoning steps, hindering the correction of erroneous reasoning. To overcome this limitation, the authors propose Self-Reset Policy Optimization (SRPO), a method that autonomously identifies incorrect steps within a reasoning trajectory and rolls back to an intermediate state to resample counterfactual continuations, thereby enabling fine-grained credit assignment. Built upon a conservative policy iteration framework, SRPO performs targeted optimization without requiring external supervision. Theoretical analysis demonstrates its superiority over random reset mechanisms, and empirical results show that SRPO significantly outperforms standard GRPO and RRPO across multiple language models and reasoning benchmarks, effectively enhancing multi-step reasoning performance.
📝 Abstract
Contemporary reinforcement learning with verifiable reward methods post-train language models on multi-step reasoning by assigning a single outcome reward uniformly across all tokens in a trajectory. Such uniform assignment ignores which steps contributed to success or failure. Improving credit assignment can address this limitation by enabling targeted refinement of faulty reasoning steps, rather than updating entire trajectories uniformly. Resets are one such simple mechanism, enabling more precise credit assignment by returning to an intermediate state and resampling counterfactual continuations, so that outcome differences can be attributed to decisions made at that point. We propose two such methods: Random-Reset Policy Optimization (RRPO), where reset states are drawn randomly from reasoning steps, and Self-Reset Policy Optimization (SRPO), where the model self-localizes the erroneous step in an incorrect trajectory and resets there. We analyze these methods within the Conservative Policy Iteration (CPI) framework. Extending CPI with a credit-assignment oracle that targets improvable states yields provable improvements over random resets. Across models and reasoning benchmarks, SRPO consistently outperforms standard GRPO and RRPO by sampling multiple suffix continuations at a self-localized reset and learning from their rewards, using only the model itself with no external supervision.