π€ AI Summary
Large language models (LLMs) suffer from degraded code reasoning performance due to implicit expressions and syntactic noise, and direct supervised fine-tuning yields suboptimal results. Method: This paper proposes the Chain of Execution (CoE) paradigmβthe first systematic framework that decomposes code execution into explicit, fine-grained stepwise reasoning traces, augmented with variable tracking and code rewriting to enhance logical clarity and reasoning diversity. Contribution/Results: We construct TracePile, a large-scale code execution trace dataset spanning mathematics, classical algorithms, and programming competitions. Leveraging code parsing, execution tracing, and natural language alignment, we apply continued pretraining followed by two-stage fine-tuning across four major foundation models. Our approach achieves state-of-the-art performance on 20 benchmarks: LLaMA3.1-8B improves by 7.1% on average across nine mathematical datasets and significantly outperforms baselines on LiveCodeBench, CRUX, and MMLU.
π Abstract
Building robust and general reasoning ability is a central goal in the development of large language models (LLMs). Recent efforts increasingly turn to code as a rich training source, given its inherent logical structure and diverse reasoning paradigms such as divide-and-conquer, topological ordering, and enumeration. However, reasoning in code is often expressed implicitly and entangled with syntactic or implementation noise, making direct training on raw code suboptimal.To address this, we introduce TracePile, a large-scale corpus of 2.6 million samples that transforms code execution into explicit, step-by-step chain-of-thought-style rationales, which we call Chain of Execution (CoE). The corpus spans domains including mathematics, classical algorithms and algorithmic competition, and is enriched with variable-tracing questions and code rewritings to enhance logical granularity and code diversity. We evaluate TracePile using three training setups: continue-pretraining, instruction tuning after pretraining, and two-stage finetuning. Experiments across four base models (LLaMA 3, LLaMA 3.1, Qwen-2.5, and Qwen-2.5 Coder) and 20 benchmarks covering math, code, logic, and algorithms demonstrate consistent improvements. Notably, TracePile boosts LLaMA3.1-8B by 7.1% on average across nine math datasets and delivers clear gains on LiveCodeBench, CRUX, and MMLU under two-stage fine-tuning.