A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions

๐Ÿ“… 2024-12-12
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
High-quality reasoning data synthesis suffers from prohibitive costs and poor scalability. Method: We propose the first knowledge graphโ€“based reasoning data synthesis framework: (1) extract domain concepts from seed data to construct a structured knowledge relation network; (2) generate multi-hop reasoning paths via graph traversal; and (3) leverage open-weight LLMs (Mistral-7B) for instruction synthesis and quality filtering. Contribution/Results: Our method achieves 255ร— data scaling, with synthesis quality on par with GPT-4 and a 100ร— reduction in per-sample cost. We release GSDP-MATHโ€”the first large-scale synthetic mathematical reasoning dataset (1.91M samples). Fine-tuning the GSDP-7B model yields state-of-the-art performance among open-weight models: 37.7% accuracy on MATH and 78.4% on GSM8K, significantly advancing open-model capabilities in mathematical reasoning.

Technology Category

Application Category

๐Ÿ“ Abstract
Synthesizing high-quality reasoning data for continual training has been proven to be effective in enhancing the performance of Large Language Models (LLMs). However, previous synthetic approaches struggle to easily scale up data and incur high costs in the pursuit of high quality. In this paper, we propose the Graph-based Synthetic Data Pipeline (GSDP), an economical and scalable framework for high-quality reasoning data synthesis. Inspired by knowledge graphs, we extracted knowledge points from seed data and constructed a knowledge point relationships graph to explore their interconnections. By exploring the implicit relationships among knowledge, our method achieves $ imes$255 data expansion. Furthermore, GSDP led by open-source models, achieves synthesis quality comparable to GPT-4-0613 while maintaining $ imes$100 lower costs. To tackle the most challenging mathematical reasoning task, we present the GSDP-MATH dataset comprising over 1.91 million pairs of math problems and answers. After fine-tuning on GSDP-MATH, GSDP-7B based on Mistral-7B achieves 37.7% accuracy on MATH and 78.4% on GSM8K, demonstrating the effectiveness of our method. The dataset and models will be released in https://github.com/Jayce1kk/GSDP.
Problem

Research questions and friction points this paper is trying to address.

Scaling high-quality reasoning data synthesis economically
Exploring knowledge interconnections via graph-based relationships
Enhancing LLM performance with cost-effective synthetic datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-based pipeline for scalable data synthesis
Knowledge graph extracts and connects knowledge points
Open-source models reduce costs significantly
๐Ÿ”Ž Similar Papers
No similar papers found.