🤖 AI Summary
This work addresses the limitation of existing code reasoning approaches that supervise only the final output, which often leads to reward hacking due to the neglect of intermediate execution states. To mitigate this, the authors propose an explicit intermediate-state supervision mechanism that automatically injects structured print statements into code as execution trace anchors, enabling the model to predict runtime states at each step. They further introduce a two-level GRPO reinforcement learning algorithm that performs structured credit assignment both across and within execution traces. This approach constitutes the first method capable of verifiable step-by-step execution modeling. Experiments demonstrate that a 7B-parameter model achieves 91.1% on CRUXEval and 86.5% on LiveCodeBench, significantly outperforming CodeReasoner-7B and GPT-4o, while also attaining 82.9% on the REval execution trace benchmark and improving overall code generation performance.
📝 Abstract
Existing code reasoning methods primarily supervise final code outputs, ignoring intermediate states, often leading to reward hacking where correct answers are obtained through inconsistent reasoning. We propose StepCodeReasoner, a framework that introduces explicit intermediate execution-state supervision. By automatically inserting structured print-based execution-trace anchors into code, the model is trained to predict runtime states at each step, transforming code reasoning into a verifiable, stepwise execution modeling problem. Building on this execution-aware method, we introduce Bi-Level GRPO, a reinforcement learning algorithm for structured credit assignment at two levels: inter-trajectory, comparing alternative execution paths, and intra-trajectory, rewarding intermediate accuracy based on its impact on downstream correctness. Extensive experiments demonstrate that StepCodeReasoner achieves SOTA performance in code reasoning. In particular, our 7B model achieves 91.1\% on CRUXEval and 86.5\% on LiveCodeBench, outperforming the CodeReasoner-7B baseline (86.0\% and 77.7\%) and GPT-4o (85.6\% and 75.1\%). Furthermore, on the execution-trace benchmark REval, our model scores 82.9\%, outperforming baseline CodeReasoner-7B (72.3\%), its 14B counterpart (81.1\%), and GPT-4o (77.3\%). Additionally, our approach also improves code generation performance, demonstrating that explicit execution modeling enhances both code reasoning and code generation.