π€ AI Summary
To address the scarcity of high-quality, diverse data in large language model (LLM) training, this paper proposes LinkSyn: a knowledge-graph-based framework that initiates from multidisciplinary seed questions and performs guided graph traversal via a knowledge-distribution value function, dynamically balancing knowledge coverage and popularity while enabling controllable adjustment of disciplinary distribution and difficulty levels. Leveraging knowledge-point extraction, diffusion-based generation (powered by DeepSeek-R1), and high-difficulty question enhancement, LinkSyn synthesizes logically coherent, cross-disciplinary question-answer pairs. Using this framework, we construct LinkQAβa 50B-token synthetic dataset. Continual pretraining on Llama-3 8B yields average improvements of 11.51% on MMLU and CMMLU, achieving state-of-the-art performance across multiple model scales.
π Abstract
The advancement of large language models (LLMs) struggles with the scarcity of high-quality, diverse training data. To address this limitation, we propose LinkSyn, a novel knowledge point (KP) graph-based synthesis framework that enables flexible control over discipline and difficulty distributions while balancing KP coverage and popularity. LinkSyn extracts KPs from question-answering (QA) seed data and constructs a KP graph to synthesize diverse QA data from multiple seeds strongly linked by KPs and sampled from graph walks. Specifically, LinkSyn incorporates (1) a knowledge distribution value function to guide the adjustment of path sampling probability and balance KP coverage and popularity during graph walks; (2) diffusion-based synthesis via DeepSeek-R1 by leveraging multiple seeds with dense logical associations along each path; and (3) high-difficulty QA enhancement within given disciplines by flexible difficulty adjustments. By executing LinkSyn, we synthesize LinkQA, a diverse multi-disciplinary QA dataset with 50B tokens. Extensive experiments on Llama-3 8B demonstrate that continual pre-training with LinkQA yields an average improvement of $mathbf{11.51%}$ on MMLU and CMMLU, establishing new SOTA results. LinkQA consistently enhances performance across model size and initial FLOPs scales.