Seed-Coder: Let the Code Model Curate Data for Itself

📅 2025-06-04

📈 Citations: 2

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Current open-source large language models (LLMs) for code rely heavily on manually crafted rules and annotations for pretraining data construction, resulting in poor cross-lingual generalization, high maintenance overhead, and susceptibility to subjective bias. To address these limitations, we propose a model-centric code data pipeline anchored on an 8B-parameter open-source code LLM. We introduce “model self-scoring filtering”—a novel, rule-free mechanism that enables autonomous generation of high-quality pretraining data. Furthermore, we design LongCoT, a reinforcement learning framework that enhances multi-step code reasoning, and establish a synergistic optimization paradigm integrating supervised fine-tuning (SFT), direct preference optimization (DPO), and LongCoT. Experiments demonstrate that our series of models achieves state-of-the-art performance among same-sized open-source models across code generation, completion, editing, reasoning, and software engineering tasks—surpassing even larger models on several key metrics.

Technology Category

Application Category

📝 Abstract

Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality filters. However, these approaches are inherently limited in scalability, prone to subjective biases, and costly to extend and maintain across diverse programming languages. To address these challenges, we introduce Seed-Coder, a series of open-source LLMs comprising base, instruct and reasoning models of 8B size, minimizing human involvement in data construction. Our code pretraining data is produced by a model-centric data pipeline, which predominantly leverages LLMs for scoring and filtering code data. The instruct model is further trained via supervised fine-tuning and preference optimization, and the reasoning model leverages Long-Chain-of-Thought (LongCoT) reinforcement learning to improve multi-step code reasoning. Seed-Coder achieves state-of-the-art results among open-source models of similar size and even surpasses some much larger models, demonstrating superior performance in code generation, code completion, code editing, code reasoning, and software engineering tasks.

Problem

Research questions and friction points this paper is trying to address.

Minimizing human effort in code pretraining data creation

Overcoming scalability and bias in code data filtering

Enhancing multi-step code reasoning in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-centric pipeline for code data scoring

LongCoT reinforcement learning for reasoning

Minimized human involvement in data construction

🔎 Similar Papers

No similar papers found.