GraSP: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data for SFT and DPO

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

High-cost and inflexible generation of high-quality synthetic data hinders efficient large language model (LLM) training. Method: This paper proposes a graph-structured, scalable synthetic data framework that unifies the dialogue modeling requirements for both supervised fine-tuning (SFT) and direct preference optimization (DPO), enabling modular configuration and complex multi-turn interaction modeling. It introduces a novel two-stage quality annotation mechanism—integrating heuristic rules with LLM-based evaluation—to automate filtering and structured organization. The framework further incorporates OASST format parsing, a rule engine, and LLM evaluation to realize an end-to-end pipeline for synthetic data generation, annotation, and management. Contribution/Results: Experiments demonstrate substantial reduction in data preparation overhead, support for large-scale and highly configurable data production, improved training integration efficiency, and enhanced consistency in data quality.

Technology Category

Application Category

📝 Abstract

The advancement of large language models (LLMs) is critically dependent on the availability of high-quality datasets for Supervised Fine-Tuning (SFT), alignment tasks like Direct Preference Optimization (DPO), etc. In this work, we present a comprehensive synthetic data generation framework that facilitates scalable, configurable, and high-fidelity generation of synthetic data tailored for these training paradigms. Our approach employs a modular and configuration-based pipeline capable of modeling complex dialogue flows with minimal manual intervention. This framework uses a dual-stage quality tagging mechanism, combining heuristic rules and LLM-based evaluations, to automatically filter and score data extracted from OASST-formatted conversations, ensuring the curation of high-quality dialogue samples. The resulting datasets are structured under a flexible schema supporting both SFT and DPO use cases, enabling seamless integration into diverse training workflows. Together, these innovations offer a robust solution for generating and managing synthetic conversational data at scale, significantly reducing the overhead of data preparation in LLM training pipelines.

Problem

Research questions and friction points this paper is trying to address.

Generating scalable synthetic data for SFT and DPO training

Automatically filtering and scoring conversation data quality

Managing synthetic dialogue datasets with minimal manual intervention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular pipeline for scalable synthetic data generation

Dual-stage quality tagging with heuristic and LLM evaluation

Flexible schema supporting both SFT and DPO integration

🔎 Similar Papers

No similar papers found.