🤖 AI Summary
Existing large language models (LLMs) for code rely heavily on proprietary models to generate large-scale instruction data, incurring prohibitive computational and financial costs. To address this, we propose an iterative self-distillation framework leveraging small-scale open-weight LLMs (e.g., 7B-parameter models), which autonomously synthesizes high-quality code instruction data via multi-checkpoint sampling, multi-dimensional automatic scoring, and gradient influence estimation. This approach eliminates dependence on closed-source models and drastically reduces data construction cost. Models in the SCoder series—trained exclusively on our synthesized data—achieve state-of-the-art performance across major code generation benchmarks, demonstrating the feasibility and effectiveness of using compact open models to generate high-fidelity instruction data. Our core contributions are threefold: (i) the first integration of gradient influence estimation into instruction data filtering; (ii) a fully open, end-to-end pipeline for instruction data synthesis; and (iii) a paradigm that jointly ensures low cost, openness, and competitive performance.
📝 Abstract
Existing code large language models (LLMs) often rely on large-scale instruction data distilled from proprietary LLMs for fine-tuning, which typically incurs high costs. In this paper, we explore the potential of small-scale open-source LLMs (e.g., 7B) as synthesizers for high-quality code instruction data construction. We first observe that the data synthesis capability of small-scale LLMs can be enhanced by training on a few superior data synthesis samples from proprietary LLMs. Building on this, we propose a novel iterative self-distillation approach to bootstrap small-scale LLMs, transforming them into powerful synthesizers that reduce reliance on proprietary LLMs and minimize costs. Concretely, in each iteration, to obtain diverse and high-quality self-distilled data, we design multi-checkpoint sampling and multi-aspect scoring strategies for initial data selection. Furthermore, to identify the most influential samples, we introduce a gradient-based influence estimation method for final data filtering. Based on the code instruction datasets from the small-scale synthesizers, we develop SCoder, a family of code generation models fine-tuned from DeepSeek-Coder. SCoder models achieve state-of-the-art code generation capabilities, demonstrating the effectiveness of our method.