Large-Scale Diverse Synthesis for Mid-Training

📅 2025-08-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The scarcity of high-quality, knowledge-intensive training data constrains large language model (LLM) development—particularly in STEM domains and on challenging tasks. To address this, we propose BoostQA: a framework that synthesizes diverse, high-difficulty question-answer (QA) pairs via multi-source seed sampling, hierarchical STEM question generation, and difficulty-enhancement strategies. BoostQA introduces a novel discipline-difficulty joint annotation scheme and a mid-training paradigm, leveraging DeepSeek-R1 for question generation and DeepSeek-V3 for answer refinement to construct a 100B-token high-fidelity QA dataset spanning cross-domain, multi-disciplinary, and multi-difficulty scenarios. Applying 40B-token mid-training on Llama-3 8B yields average improvements of 12.74% on MMLU and CMMLU, achieving state-of-the-art performance on 12 benchmarks. Crucially, gains scale consistently with model size, data volume, and compute budget.

Technology Category

Application Category

📝 Abstract
The scarcity of high-quality, knowledge-intensive training data hinders the development of large language models (LLMs), as traditional corpora provide limited information. Previous studies have synthesized and integrated corpora-dependent question-answering (QA) data to improve model performance but face challenges in QA data scalability and knowledge diversity, particularly in cross-domain contexts. Furthermore, leveraging our designed discipline and difficulty annotation system, we probe model deficiencies in STEM disciplines and high-difficulty data. To overcome these limitations, we propose a novel diversified pipeline to synthesize BoostQA, a 100B-token large-scale QA dataset. Our synthesis framework: (1) curates seed data from heterogeneous sources; (2) utilizes DeepSeek-R1 to implement STEM-focused multi-grade synthesis to boost data diversity and high-difficulty synthesis to mitigate difficulty degradation; (3) refines answers via DeepSeek-V3 to improve output quality. We utilize BoostQA in mid-training, a mid-stage between pre-training and post-training, to optimize domain-specific knowledge acquisition and enhance data quality. Our method enables Llama-3 8B, mid-trained on a 40B-token dataset, to achieve an average improvement of $mathbf{12.74%}$ on MMLU and CMMLU and establish SOTA average performance across 12 benchmarks. BoostQA also demonstrates robust scalability, with performance consistently improving as model size, data volume, and initial FLOPs scale.
Problem

Research questions and friction points this paper is trying to address.

Addresses scarcity of high-quality knowledge-intensive training data for LLMs
Improves QA data scalability and cross-domain knowledge diversity
Mitigates model deficiencies in STEM disciplines and high-difficulty data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneous seed data curation for diversity
STEM-focused multi-grade synthesis for difficulty
DeepSeek-V3 answer refinement for quality
🔎 Similar Papers
No similar papers found.