🤖 AI Summary
High-cost and inflexible generation of high-quality synthetic data hinders efficient large language model (LLM) training. Method: This paper proposes a graph-structured, scalable synthetic data framework that unifies the dialogue modeling requirements for both supervised fine-tuning (SFT) and direct preference optimization (DPO), enabling modular configuration and complex multi-turn interaction modeling. It introduces a novel two-stage quality annotation mechanism—integrating heuristic rules with LLM-based evaluation—to automate filtering and structured organization. The framework further incorporates OASST format parsing, a rule engine, and LLM evaluation to realize an end-to-end pipeline for synthetic data generation, annotation, and management. Contribution/Results: Experiments demonstrate substantial reduction in data preparation overhead, support for large-scale and highly configurable data production, improved training integration efficiency, and enhanced consistency in data quality.
📝 Abstract
The advancement of large language models (LLMs) is critically dependent on the availability of high-quality datasets for Supervised Fine-Tuning (SFT), alignment tasks like Direct Preference Optimization (DPO), etc. In this work, we present a comprehensive synthetic data generation framework that facilitates scalable, configurable, and high-fidelity generation of synthetic data tailored for these training paradigms. Our approach employs a modular and configuration-based pipeline capable of modeling complex dialogue flows with minimal manual intervention. This framework uses a dual-stage quality tagging mechanism, combining heuristic rules and LLM-based evaluations, to automatically filter and score data extracted from OASST-formatted conversations, ensuring the curation of high-quality dialogue samples. The resulting datasets are structured under a flexible schema supporting both SFT and DPO use cases, enabling seamless integration into diverse training workflows. Together, these innovations offer a robust solution for generating and managing synthetic conversational data at scale, significantly reducing the overhead of data preparation in LLM training pipelines.