GraSP: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data for SFT and DPO

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

187K/year
🤖 AI Summary
High-cost and inflexible generation of high-quality synthetic data hinders efficient large language model (LLM) training. Method: This paper proposes a graph-structured, scalable synthetic data framework that unifies the dialogue modeling requirements for both supervised fine-tuning (SFT) and direct preference optimization (DPO), enabling modular configuration and complex multi-turn interaction modeling. It introduces a novel two-stage quality annotation mechanism—integrating heuristic rules with LLM-based evaluation—to automate filtering and structured organization. The framework further incorporates OASST format parsing, a rule engine, and LLM evaluation to realize an end-to-end pipeline for synthetic data generation, annotation, and management. Contribution/Results: Experiments demonstrate substantial reduction in data preparation overhead, support for large-scale and highly configurable data production, improved training integration efficiency, and enhanced consistency in data quality.

Technology Category

Application Category

📝 Abstract
The advancement of large language models (LLMs) is critically dependent on the availability of high-quality datasets for Supervised Fine-Tuning (SFT), alignment tasks like Direct Preference Optimization (DPO), etc. In this work, we present a comprehensive synthetic data generation framework that facilitates scalable, configurable, and high-fidelity generation of synthetic data tailored for these training paradigms. Our approach employs a modular and configuration-based pipeline capable of modeling complex dialogue flows with minimal manual intervention. This framework uses a dual-stage quality tagging mechanism, combining heuristic rules and LLM-based evaluations, to automatically filter and score data extracted from OASST-formatted conversations, ensuring the curation of high-quality dialogue samples. The resulting datasets are structured under a flexible schema supporting both SFT and DPO use cases, enabling seamless integration into diverse training workflows. Together, these innovations offer a robust solution for generating and managing synthetic conversational data at scale, significantly reducing the overhead of data preparation in LLM training pipelines.
Problem

Research questions and friction points this paper is trying to address.

Generating scalable synthetic data for SFT and DPO training
Automatically filtering and scoring conversation data quality
Managing synthetic dialogue datasets with minimal manual intervention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular pipeline for scalable synthetic data generation
Dual-stage quality tagging with heuristic and LLM evaluation
Flexible schema supporting both SFT and DPO integration
🔎 Similar Papers
No similar papers found.