🤖 AI Summary
This work addresses the scalability limitations of large code models in competitive programming, which often stem from reliance on real-world data. To overcome this, we propose SynthSmith—the first framework to train competition-level code models exclusively on synthetic data. Our approach employs a feature-driven synthesis strategy to automatically generate programming tasks, reference solutions, and test cases, followed by supervised fine-tuning and code-centric reinforcement learning. A systematic ablation study validates the design choices. The resulting model, X-Coder-7B, achieves pass rates of 62.9 avg@8 on LiveCodeBench v5 and 55.8 on v6, outperforming several 14B-parameter models. This study provides the first empirical evidence that purely synthetic data can effectively support complex code reasoning, demonstrating both feasibility and scalability in training high-performance code generation models.
📝 Abstract
Competitive programming poses a significant challenge for Code LLMs. While recent models have shown promise, they heavily rely on finite real-world data, raising concerns about scalability and contamination. In this paper, we investigate a critical question: Can we elevate models to expert-level reasoning performance using fully synthetic data? In response, we first observe that off-the-shelf synthesis methods yield suboptimal results in this domain. To address this, we systematically investigate the key factors governing synthetic data quality. Leveraging these findings, we significantly advance the feature-based synthesis paradigm via domain-specific evolution and a dual-verification strategy, promoting task solvability, solution correctness, and test accuracy. Using this high-quality synthetic data, we train the X-Coder model series under an SFT-then-RL paradigm. X-Coder-7B shows significant performance gains on the challenging LiveCodeBench v5 (62.9% avg@8) and v6 (55.8% avg@8), outperforming larger models trained on real-world data. Extensive analysis distills valuable insights into synthetic data scaling, the necessity of domain-adapted feature evolution, and code-centric reinforcement.