DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models

📅 2024-11-05

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Large language models (LLMs) suffer from weak modeling of target distributions and cumbersome prompt engineering when synthesizing structured data (e.g., tables, code, tool calls). Method: This paper proposes a controllable synthesis framework based on variational autoencoders (VAEs), introducing diffusion modeling into the language latent space for high-fidelity joint distribution modeling of structured data. It further designs a plug-and-play latent feature injection module that decouples distribution learning from generation control. Contribution/Results: The framework achieves simultaneous preservation of syntactic constraints and semantic distributions without complex prompting. Evaluated on seven real-world structured datasets, it improves downstream task performance by 2–7 percentage points over ground-truth data, demonstrating significant gains in synthetic data quality, controllability, and generalization.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have significantly enhanced their knowledge and generative capabilities, leading to a surge of interest in leveraging LLMs for high-quality data synthesis. However, synthetic data generation via prompting LLMs remains challenging due to LLMs' limited understanding of target data distributions and the complexity of prompt engineering, especially for structured formatted data. To address these issues, we introduce DiffLM, a controllable data synthesis framework based on variational autoencoder (VAE), which further (1) leverages diffusion models to reserve more information of original distribution and format structure in the learned latent distribution and (2) decouples the learning of target distribution knowledge from the LLM's generative objectives via a plug-and-play latent feature injection module. As we observed significant discrepancies between the VAE's latent representations and the real data distribution, the latent diffusion module is introduced into our framework to learn a fully expressive latent distribution. Evaluations on seven real-world datasets with structured formatted data (i.e., Tabular, Code and Tool data) demonstrate that DiffLM generates high-quality data, with performance on downstream tasks surpassing that of real data by 2-7 percent in certain cases. The data and code will be publicly available upon completion of internal review.

Problem

Research questions and friction points this paper is trying to address.

Enhances synthetic data generation via diffusion models

Improves understanding of target data distributions in LLMs

Decouples distribution learning from generative objectives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses diffusion models for latent distribution learning

Decouples knowledge learning via plug-and-play module

Integrates latent diffusion to enhance representation accuracy

🔎 Similar Papers

A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models