Chain of Execution Supervision Promotes General Reasoning in Large Language Models

πŸ“… 2025-10-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Large language models (LLMs) suffer from degraded code reasoning performance due to implicit expressions and syntactic noise, and direct supervised fine-tuning yields suboptimal results. Method: This paper proposes the Chain of Execution (CoE) paradigmβ€”the first systematic framework that decomposes code execution into explicit, fine-grained stepwise reasoning traces, augmented with variable tracking and code rewriting to enhance logical clarity and reasoning diversity. Contribution/Results: We construct TracePile, a large-scale code execution trace dataset spanning mathematics, classical algorithms, and programming competitions. Leveraging code parsing, execution tracing, and natural language alignment, we apply continued pretraining followed by two-stage fine-tuning across four major foundation models. Our approach achieves state-of-the-art performance on 20 benchmarks: LLaMA3.1-8B improves by 7.1% on average across nine mathematical datasets and significantly outperforms baselines on LiveCodeBench, CRUX, and MMLU.

Technology Category

Application Category

πŸ“ Abstract
Building robust and general reasoning ability is a central goal in the development of large language models (LLMs). Recent efforts increasingly turn to code as a rich training source, given its inherent logical structure and diverse reasoning paradigms such as divide-and-conquer, topological ordering, and enumeration. However, reasoning in code is often expressed implicitly and entangled with syntactic or implementation noise, making direct training on raw code suboptimal.To address this, we introduce TracePile, a large-scale corpus of 2.6 million samples that transforms code execution into explicit, step-by-step chain-of-thought-style rationales, which we call Chain of Execution (CoE). The corpus spans domains including mathematics, classical algorithms and algorithmic competition, and is enriched with variable-tracing questions and code rewritings to enhance logical granularity and code diversity. We evaluate TracePile using three training setups: continue-pretraining, instruction tuning after pretraining, and two-stage finetuning. Experiments across four base models (LLaMA 3, LLaMA 3.1, Qwen-2.5, and Qwen-2.5 Coder) and 20 benchmarks covering math, code, logic, and algorithms demonstrate consistent improvements. Notably, TracePile boosts LLaMA3.1-8B by 7.1% on average across nine math datasets and delivers clear gains on LiveCodeBench, CRUX, and MMLU under two-stage fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

Transforming implicit code reasoning into explicit step-by-step rationales
Building robust general reasoning ability in large language models
Addressing suboptimal training from entangled syntactic implementation noise
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transforms code execution into explicit step-by-step rationales
Enriches corpus with variable-tracing and code rewriting techniques
Uses three training setups including two-stage fine-tuning
πŸ”Ž Similar Papers
No similar papers found.