Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models

πŸ“… 2025-05-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address data scarcity and inefficiency in continual pretraining of large language models (LLMs) on small-scale domain-specific corpora, this paper proposes a synthetic data generation method grounded in cross-document knowledge graphs. Unlike existing approaches that model only intra-document content, our method explicitly constructs a knowledge graph capturing inter-document entity and concept associations, then employs graph traversal to sample synthetic texts exhibiting high lexical diversity and strong knowledge connectivity. We further integrate chain-of-thought reasoning with contrastive clarification to enhance the model’s multi-step reasoning capability and discriminative accuracy. Experiments demonstrate that our approach significantly outperforms state-of-the-art methods on multi-hop question answering, achieves comparable performance on reading comprehension, and exhibits superior capability in modeling rare knowledge and cross-domain generalization.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) have achieved remarkable success but remain data-inefficient, especially when learning from small, specialized corpora with limited and proprietary data. Existing synthetic data generation methods for continue pre-training focus on intra-document content and overlook cross-document knowledge associations, limiting content diversity and depth. We propose Synthetic-on-Graph (SoG), a synthetic data generation framework that incorporates cross-document knowledge associations for efficient corpus expansion. SoG constructs a context graph by extracting entities and concepts from the original corpus, representing cross-document associations, and employing a graph walk strategy for knowledge-associated sampling. This enhances synthetic data diversity and coherence, enabling models to learn complex knowledge structures and handle rare knowledge. To further improve synthetic data quality, we integrate Chain-of-Thought (CoT) and Contrastive Clarifying (CC) synthetic, enhancing reasoning processes and discriminative power. Experiments show that SoG outperforms the state-of-the-art (SOTA) method in a multi-hop document Q&A dataset while performing comparably to the SOTA method on the reading comprehension task datasets, which also underscores the better generalization capability of SoG. Our work advances synthetic data generation and provides practical solutions for efficient knowledge acquisition in LLMs, especially in domains with limited data availability.
Problem

Research questions and friction points this paper is trying to address.

Enhancing synthetic data diversity via cross-document knowledge associations
Improving LLMs' ability to learn complex knowledge structures
Addressing data inefficiency in specialized corpora for pre-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-based synthetic data generation with cross-document associations
Integrates Chain-of-Thought and Contrastive Clarifying techniques
Enhances data diversity and coherence via knowledge-aware sampling
πŸ”Ž Similar Papers
No similar papers found.