PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

High-quality reasoning training data is scarce, and existing synthetic approaches produce problems of insufficient difficulty and limited diversity. Method: This paper proposes an automated prompt synthesis framework based on the Expectation-Maximization (EM) paradigm—requiring no hand-crafted rules—that iteratively optimizes chain-of-thought generation to synthesize math and programming problems with high difficulty and broad coverage. The framework supports two supervised fine-tuning (SFT) paradigms: self-play and teacher distillation. Results: Evaluated on Qwen3-30B and Qwen2.5-7B, models trained on our synthetic data significantly outperform state-of-the-art methods on AIME, HMMT, and LiveCodeBench; Codeforces Elo increases by +35, and even small models surpass those trained on human-annotated data. Our core contribution is the first scalable, heuristic-free, EM-driven reasoning problem synthesis mechanism—enabling co-evolution of synthetic data quality and model reasoning capability.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are evolving from conversational systems into strong reasoners for tasks such as Olympiad mathematics and competitive programming. While scaling parameters and test-time computation has driven progress, a key bottleneck is the lack of high-quality training problems: human-curated datasets are costly and limited, while existing synthetic corpora are often too easy or narrow. PromptCoT 1.0 showed that injecting rationales into prompt synthesis increases problem difficulty. Building on this, we present PromptCoT 2.0, a scalable framework that replaces hand-crafted heuristics with an expectation-maximization (EM) loop, where rationales are iteratively refined to guide prompt construction. This produces problems that are both harder and more diverse than prior corpora. The synthetic prompts support two post-training regimes: (1) Self-Play, where strong models improve autonomously via verifiable feedback without stronger teachers; and (2) Supervised Fine-Tuning (SFT), where weaker models learn from teacher-distilled traces. Extensive experiments demonstrate the effectiveness of this approach. In self-play, applying PromptCoT 2.0 to Qwen3-30B-A3B-Thinking-2507 sets new state-of-the-art results at the 30B scale, with +4.4, +4.8, and +5.3 on AIME 24/25 and HMMT 25, +6.1 and +5.0 on LiveCodeBench v5/v6, and +35 Elo on Codeforces. In SFT, training Qwen2.5-7B-Instruct solely on synthetic prompts boosts accuracy to 73.1 (AIME 24), 65.6 (AIME 25), and 53.4 (LiveCodeBench v5), surpassing models trained on human or hybrid data. Analyses further confirm that PromptCoT 2.0 yields fundamentally harder and distributionally distinct problems. These results establish prompt synthesis as a new axis for scaling reasoning and position PromptCoT 2.0 as a scalable foundation for future open-source models. The implementation is available at https://github.com/inclusionAI/PromptCoT.

Problem

Research questions and friction points this paper is trying to address.

Generating high-quality training problems for large language model reasoning tasks

Overcoming limitations of costly human-curated datasets and narrow synthetic corpora

Creating harder and more diverse problems through iterative rationale refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses EM loop to iteratively refine rationales for prompts

Generates harder and more diverse synthetic training problems

Enables self-play and SFT regimes for model improvement

🔎 Similar Papers

No similar papers found.