SCoder: Iterative Self-Distillation for Bootstrapping Small-Scale Data Synthesizers to Empower Code LLMs

📅 2025-09-09

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing large language models (LLMs) for code rely heavily on proprietary models to generate large-scale instruction data, incurring prohibitive computational and financial costs. To address this, we propose an iterative self-distillation framework leveraging small-scale open-weight LLMs (e.g., 7B-parameter models), which autonomously synthesizes high-quality code instruction data via multi-checkpoint sampling, multi-dimensional automatic scoring, and gradient influence estimation. This approach eliminates dependence on closed-source models and drastically reduces data construction cost. Models in the SCoder series—trained exclusively on our synthesized data—achieve state-of-the-art performance across major code generation benchmarks, demonstrating the feasibility and effectiveness of using compact open models to generate high-fidelity instruction data. Our core contributions are threefold: (i) the first integration of gradient influence estimation into instruction data filtering; (ii) a fully open, end-to-end pipeline for instruction data synthesis; and (iii) a paradigm that jointly ensures low cost, openness, and competitive performance.

Technology Category

Application Category

📝 Abstract

Existing code large language models (LLMs) often rely on large-scale instruction data distilled from proprietary LLMs for fine-tuning, which typically incurs high costs. In this paper, we explore the potential of small-scale open-source LLMs (e.g., 7B) as synthesizers for high-quality code instruction data construction. We first observe that the data synthesis capability of small-scale LLMs can be enhanced by training on a few superior data synthesis samples from proprietary LLMs. Building on this, we propose a novel iterative self-distillation approach to bootstrap small-scale LLMs, transforming them into powerful synthesizers that reduce reliance on proprietary LLMs and minimize costs. Concretely, in each iteration, to obtain diverse and high-quality self-distilled data, we design multi-checkpoint sampling and multi-aspect scoring strategies for initial data selection. Furthermore, to identify the most influential samples, we introduce a gradient-based influence estimation method for final data filtering. Based on the code instruction datasets from the small-scale synthesizers, we develop SCoder, a family of code generation models fine-tuned from DeepSeek-Coder. SCoder models achieve state-of-the-art code generation capabilities, demonstrating the effectiveness of our method.

Problem

Research questions and friction points this paper is trying to address.

Reducing reliance on costly proprietary LLMs for code instruction data

Enhancing small-scale LLMs as synthesizers for code generation

Developing iterative self-distillation to bootstrap data synthesis capability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative self-distillation bootstraps small LLMs

Multi-checkpoint sampling enhances data diversity

Gradient-based influence estimation filters samples

🔎 Similar Papers

Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models