CodeReasoner: Enhancing the Code Reasoning Ability with Reinforcement Learning

📅 2025-07-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) exhibit limited capability in code execution behavior reasoning—such as output prediction and statement reachability analysis—with supervised fine-tuning suffering from poor generalization. To address this, we propose a two-stage training framework: (1) a trajectory-guided data construction phase that synthesizes high-quality reasoning-chain datasets from program execution traces, augmented with instruction tuning; and (2) a GRPO-based reinforcement learning phase that jointly optimizes the policy via knowledge distillation. Our approach is the first to enable a 7B-parameter model to match GPT-4o’s performance on key code reasoning benchmarks, achieving absolute improvements of 27.1%–40.2% across three standard benchmarks; the 14B variant consistently surpasses GPT-4o. The core contributions are a program-execution-logic-aware data construction paradigm and a GRPO-driven reasoning capability alignment mechanism, which collectively enhance generalization and scalability of compact LLMs for code reasoning.

Technology Category

Application Category

📝 Abstract
Code reasoning is a fundamental capability for large language models (LLMs) in the code domain. It involves understanding and predicting a program's execution behavior, such as determining the output for a given input or whether a specific statement will be executed. This capability is essential for downstream tasks like debugging, code generation, and program repair. Prior approaches mainly rely on supervised fine-tuning to improve performance in code reasoning tasks. However, they often show limited gains and fail to generalize across diverse scenarios. We argue this is due to two core issues: the low quality of training data and the limitations of supervised fine-tuning, which struggles to teach general reasoning skills. To address these challenges, we propose CodeReasoner, a framework that spans both dataset construction and a two-stage training process. First, we introduce a method to construct datasets that focus on the core execution logic of Python programs. Next, we apply instruction tuning to inject execution-specific knowledge distilled from a powerful teacher model. We then enhance reasoning and generalization through GRPO reinforcement learning on top of the fine-tuned model. Experiments on three widely-used code reasoning benchmarks show that CodeReasoner improves performance by 27.1% to 40.2% over prior methods using a 7B model. Notably, the 7B model matches GPT-4o on key tasks like input/output and coverage prediction. When scaled to 14B, CodeReasoner outperforms GPT-4o across all benchmarks. Ablation studies confirm the effectiveness of each training stage and highlight the importance of reasoning chains.
Problem

Research questions and friction points this paper is trying to address.

Improving code reasoning in LLMs via reinforcement learning
Addressing low-quality data and supervised fine-tuning limitations
Enhancing generalization across diverse code execution scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs datasets focusing on Python execution logic
Uses instruction tuning for execution-specific knowledge
Applies GRPO reinforcement learning for reasoning enhancement
🔎 Similar Papers
No similar papers found.