ExecVerify: White-Box RL with Verifiable Stepwise Rewards for Code Execution Reasoning

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited code execution reasoning capabilities of small-scale code large language models and the inadequacy of existing supervised fine-tuning methods, which struggle to verify intermediate reasoning steps and lack control over task difficulty. To overcome these limitations, the authors propose a verifiable white-box reward mechanism based on execution traces—such as next-statement prediction and variable value or type prediction—and, for the first time, integrate it into reinforcement learning to jointly optimize both intermediate reasoning and final outputs. They further construct a synthetic program dataset with multiple difficulty levels and design a two-stage training pipeline: first enhancing execution reasoning ability and then transferring this capability to code generation. Experiments demonstrate that a 7B-parameter model achieves performance on code reasoning benchmarks comparable to that of a 32B-parameter model, with up to a 5.9% absolute improvement in pass@1 on code generation tasks.

Technology Category

Application Category

📝 Abstract
Code LLMs still struggle with code execution reasoning, especially in smaller models. Existing methods rely on supervised fine-tuning (SFT) with teacher-generated explanations, primarily in two forms: (1) input-output (I/O) prediction chains and (2) natural-language descriptions of execution traces. However, intermediate execution steps cannot be explicitly verified during SFT, so the training objective can reduce to merely matching teacher explanations. Moreover, training data is typically collected without explicit control over task difficulty. We introduce ExecVerify, which goes beyond text imitation by incorporating verifiable white-box rewards derived from execution traces, including next-statement prediction and variable value/type prediction. Our work first builds a dataset with multiple difficulty levels via constraint-based program synthesis. Then, we apply reinforcement learning (RL) to reward correct answers about both intermediate execution steps and final outputs, aligning the training objective with semantic correctness at each execution step. Finally, we adopt a two-stage training pipeline that first enhances execution reasoning and then transfers to code generation. Experiments demonstrate that a 7B model trained with ExecVerify achieves performance comparable to 32B models on code reasoning benchmarks and improves pass@1 by up to 5.9\% on code generation tasks over strong post-training baselines.
Problem

Research questions and friction points this paper is trying to address.

code execution reasoning
verifiable rewards
intermediate execution steps
task difficulty
supervised fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

white-box reinforcement learning
verifiable rewards
code execution reasoning
constraint-based program synthesis
two-stage training
🔎 Similar Papers
No similar papers found.