LinkQA: Synthesizing Diverse QA from Multiple Seeds Strongly Linked by Knowledge Points

πŸ“… 2025-08-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the scarcity of high-quality, diverse data in large language model (LLM) training, this paper proposes LinkSyn: a knowledge-graph-based framework that initiates from multidisciplinary seed questions and performs guided graph traversal via a knowledge-distribution value function, dynamically balancing knowledge coverage and popularity while enabling controllable adjustment of disciplinary distribution and difficulty levels. Leveraging knowledge-point extraction, diffusion-based generation (powered by DeepSeek-R1), and high-difficulty question enhancement, LinkSyn synthesizes logically coherent, cross-disciplinary question-answer pairs. Using this framework, we construct LinkQAβ€”a 50B-token synthetic dataset. Continual pretraining on Llama-3 8B yields average improvements of 11.51% on MMLU and CMMLU, achieving state-of-the-art performance across multiple model scales.

Technology Category

Application Category

πŸ“ Abstract
The advancement of large language models (LLMs) struggles with the scarcity of high-quality, diverse training data. To address this limitation, we propose LinkSyn, a novel knowledge point (KP) graph-based synthesis framework that enables flexible control over discipline and difficulty distributions while balancing KP coverage and popularity. LinkSyn extracts KPs from question-answering (QA) seed data and constructs a KP graph to synthesize diverse QA data from multiple seeds strongly linked by KPs and sampled from graph walks. Specifically, LinkSyn incorporates (1) a knowledge distribution value function to guide the adjustment of path sampling probability and balance KP coverage and popularity during graph walks; (2) diffusion-based synthesis via DeepSeek-R1 by leveraging multiple seeds with dense logical associations along each path; and (3) high-difficulty QA enhancement within given disciplines by flexible difficulty adjustments. By executing LinkSyn, we synthesize LinkQA, a diverse multi-disciplinary QA dataset with 50B tokens. Extensive experiments on Llama-3 8B demonstrate that continual pre-training with LinkQA yields an average improvement of $mathbf{11.51%}$ on MMLU and CMMLU, establishing new SOTA results. LinkQA consistently enhances performance across model size and initial FLOPs scales.
Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of high-quality diverse training data for LLMs
Synthesizing QA data with controlled discipline and difficulty
Enhancing model performance via knowledge point linked QA generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

KP graph-based synthesis for diverse QA data
Diffusion-based synthesis using DeepSeek-R1 model
Flexible difficulty adjustments for high-difficulty QA
πŸ”Ž Similar Papers
No similar papers found.