🤖 AI Summary
The construction of high-quality training data for Code Large Language Models (CodeLLMs) is hindered by the lack of systematic frameworks and reproducible standards for synthetic data generation and filtering.
Method: We propose the first taxonomy of data synthesis techniques specifically designed for CodeLLMs, unifying paradigms including LLM self-generation, execution-based filtering, syntactic/semantic constraint augmentation, multi-stage knowledge distillation, and quality scoring—accompanied by rigorously defined evaluation dimensions and practical guidelines.
Contribution/Results: Our framework explicitly characterizes the fundamental trade-offs among data noise, semantic fidelity, and computational efficiency, and delivers a reusable technical roadmap with actionable implementation protocols. Empirical evaluation demonstrates that models trained on data synthesized via our framework achieve significantly improved code generation performance across standard benchmarks, substantially lowering the barrier to producing high-quality synthetic code data.
📝 Abstract
Large language models (LLMs) have shown impressive performance in emph{code} understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data synthesis and filtering techniques have been widely adopted and shown to be highly effective in this context. In this paper, we present a focused survey and taxonomy of these techniques, emphasizing recent advancements. We highlight key challenges, explore future research directions, and offer practical guidance for new researchers entering the field.