SoftSRV: Learn to Generate Targeted Synthetic Data

๐Ÿ“… 2024-10-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

212K/year
๐Ÿค– AI Summary
Synthetic data generation commonly relies on handcrafted prompt templates, incurring high domain-adaptation costs and suffering from poor generalizability. Method: We propose SoftSRVโ€”a domain-agnostic, end-to-end framework that freezes a large language model (LLM) and employs a differentiable target-distribution alignment loss, optimized via gradient-based optimization to generate high-quality, task-specific synthetic fine-tuning data without any manual prompt engineering. MAUVE is adopted as the distributional similarity metric to ensure synthetic data closely approximates the real target distribution. Contribution/Results: SoftSRV is the first domain-independent, end-to-end method for targeted synthetic data generation. Evaluated on programming, mathematical reasoning, and general reasoning tasks, the generated data significantly boosts downstream small-model performance. Moreover, SoftSRV achieves higher MAUVE scores than state-of-the-art prompt-engineering baselines, demonstrating its strong generalizability, effectiveness, and reusability.

Technology Category

Application Category

๐Ÿ“ Abstract
We present a novel framework, SoftSRV, that is used to generate targeted synthetic fine-tuning data for improving task-specific model performance. Given a sample from a target distribution, our proposed framework uses a data-driven loss minimization approach to steer a frozen large language model (LLM) to generate synthetic sequences that are similar to those from the target distribution. SoftSRV provides a practical improvement over common prompt engineering approaches that rely on human-engineered prompt-templates, which can be idiosyncratic, labor-intensive to craft, and may need to be specialized per domain. We empirically evaluate our method against standard baselines guiding a large LLM to generate synthetic data to fine-tune a smaller language model on three different domains (coding, math, reasoning). We perform these evaluations without any particular specialization of the framework to each domain, emphasizing the generality of our approach. We find that SoftSRV improves upon typical prompt engineering approaches, generating targeted data that leads to fine-tuned models with significantly better task-specific performance. In addition, SoftSRV-generated data better matches the target distribution according to the MAUVE similarity metric.
Problem

Research questions and friction points this paper is trying to address.

Generate targeted synthetic data
Improve task-specific model performance
Enhance data-driven loss minimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates targeted synthetic data
Uses data-driven loss minimization
Improves task-specific model performance
๐Ÿ”Ž Similar Papers