🤖 AI Summary
The scarcity of high-quality, knowledge-intensive training data constrains large language model (LLM) development—particularly in STEM domains and on challenging tasks. To address this, we propose BoostQA: a framework that synthesizes diverse, high-difficulty question-answer (QA) pairs via multi-source seed sampling, hierarchical STEM question generation, and difficulty-enhancement strategies. BoostQA introduces a novel discipline-difficulty joint annotation scheme and a mid-training paradigm, leveraging DeepSeek-R1 for question generation and DeepSeek-V3 for answer refinement to construct a 100B-token high-fidelity QA dataset spanning cross-domain, multi-disciplinary, and multi-difficulty scenarios. Applying 40B-token mid-training on Llama-3 8B yields average improvements of 12.74% on MMLU and CMMLU, achieving state-of-the-art performance on 12 benchmarks. Crucially, gains scale consistently with model size, data volume, and compute budget.
📝 Abstract
The scarcity of high-quality, knowledge-intensive training data hinders the development of large language models (LLMs), as traditional corpora provide limited information. Previous studies have synthesized and integrated corpora-dependent question-answering (QA) data to improve model performance but face challenges in QA data scalability and knowledge diversity, particularly in cross-domain contexts. Furthermore, leveraging our designed discipline and difficulty annotation system, we probe model deficiencies in STEM disciplines and high-difficulty data. To overcome these limitations, we propose a novel diversified pipeline to synthesize BoostQA, a 100B-token large-scale QA dataset. Our synthesis framework: (1) curates seed data from heterogeneous sources; (2) utilizes DeepSeek-R1 to implement STEM-focused multi-grade synthesis to boost data diversity and high-difficulty synthesis to mitigate difficulty degradation; (3) refines answers via DeepSeek-V3 to improve output quality. We utilize BoostQA in mid-training, a mid-stage between pre-training and post-training, to optimize domain-specific knowledge acquisition and enhance data quality. Our method enables Llama-3 8B, mid-trained on a 40B-token dataset, to achieve an average improvement of $mathbf{12.74%}$ on MMLU and CMMLU and establish SOTA average performance across 12 benchmarks. BoostQA also demonstrates robust scalability, with performance consistently improving as model size, data volume, and initial FLOPs scale.