🤖 AI Summary
Existing programming reasoning datasets are small (thousands to tens of thousands of problems), limited in difficulty, and rely heavily on manual annotation or simplistic synthetic generation, hindering scalable advancement of large language models (LLMs) on complex reasoning tasks.
Method: We propose QueST, the first framework integrating difficulty-aware graph sampling with rejection-based fine-tuning to enable end-to-end generation of high-difficulty programming problems and joint model optimization—supporting both chain-of-thought distillation and reinforcement learning.
Contribution/Results: Using QueST, we generate 100K high-quality, challenging problems. Fine-tuning Qwen3-8B-base on this data achieves significant gains on LiveCodeBench over baselines. Further augmenting with 28K human-authored problems and synthetic solutions yields performance competitive with the 671B-parameter DeepSeek-R1. This work breaks the dependency of LLM reasoning capability on manual annotations, establishing a self-sustaining positive feedback loop between data generation and model improvement.
📝 Abstract
Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems. However, their scalability is limited by human-labeled datasets and the lack of large-scale, challenging coding problem training data. Existing competitive coding datasets contain only thousands to tens of thousands of problems. Previous synthetic data generation methods rely on either augmenting existing instruction datasets or selecting challenging problems from human-labeled data. In this paper, we propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning that directly optimizes specialized generators to create challenging coding problems. Our trained generators demonstrate superior capability compared to even GPT-4o at creating challenging problems that benefit downstream performance. We leverage QueST to generate large-scale synthetic coding problems, which we then use to distill from strong teacher models with long chain-of-thought or to conduct reinforcement learning for smaller models, proving effective in both scenarios. Our distillation experiments demonstrate significant performance gains. Specifically, after fine-tuning Qwen3-8B-base on 100K difficult problems generated by QueST, we surpass the performance of the original Qwen3-8B on LiveCodeBench. With an additional 112K examples (i.e., 28K human-written problems paired with multiple synthetic solutions), our 8B model matches the performance of the much larger DeepSeek-R1-671B. These findings indicate that generating complex problems via QueST offers an effective and scalable approach to advancing the frontiers of competitive coding and reasoning for large language models.