🤖 AI Summary
This work challenges the prevailing assumption that reinforcement learning (RL) is indispensable for surpassing supervised fine-tuning (SFT) in programming reasoning tasks (e.g., CodeContests, LiveCodeBench), focusing instead on constructing high-quality SFT datasets to enhance small- and medium-scale models.
Method: We propose an instruction-decoupled data filtering paradigm, systematically revealing for the first time the detrimental impact of execution-based filtering on code reasoning distillation. We establish a new principle—“instruction diversity outweighs solution correctness”—and further refine data quality via token-efficiency analysis and reasoning-path validation.
Contribution/Results: We open-source both a high-quality SFT dataset and corresponding models. Empirical results show that SFT-only models achieve 61.8% on LiveCodeBench and 24.6% on CodeContests—substantially outperforming same-scale RL baselines. This demonstrates that carefully curated SFT data is pivotal for advancing programming reasoning capabilities without RL.
📝 Abstract
Since the advent of reasoning-based large language models, many have found great success from distilling reasoning capabilities into student models. Such techniques have significantly bridged the gap between reasoning and standard LLMs on coding tasks. Despite this, much of the progress on distilling reasoning models remains locked behind proprietary datasets or lacks details on data curation, filtering and subsequent training. To address this, we construct a superior supervised fine-tuning (SFT) dataset that we use to achieve state-of-the-art coding capability results in models of various sizes. Our distilled models use only SFT to achieve 61.8% on LiveCodeBench and 24.6% on CodeContests, surpassing alternatives trained with reinforcement learning. We then perform analysis on the data sources used to construct our dataset, the impact of code execution filtering, and the importance of instruction/solution diversity. We observe that execution filtering negatively affected benchmark accuracy, leading us to prioritize instruction diversity over solution correctness. Finally, we also analyze the token efficiency and reasoning patterns utilized by these models. We will open-source these datasets and distilled models to the community.