Towards Active Synthetic Data Generation for Finetuning Language Models

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the limited adaptability of static synthetic data generation in language model fine-tuning. We propose a dynamic closed-loop synthetic data generation paradigm: during training, samples generated by a teacher model are actively selected based on the student model’s current state—such as prediction uncertainty and hidden-layer activations—enabling iterative optimization via “generate–evaluate–select–fine-tune”. Our key contribution is a lightweight, interpretable active selection strategy that significantly outperforms complex sampling methods. Evaluated on four mathematical and logical reasoning benchmarks, our approach consistently improves the performance of four small language models under fixed computational budgets, yielding average accuracy gains of 3.2–5.7 percentage points. These results demonstrate the method’s effectiveness, generalizability across diverse models and tasks, and computational efficiency.

Technology Category

Application Category

📝 Abstract

A common and effective means for improving language model capabilities involves finetuning a ``student'' language model's parameters on generations from a more proficient ``teacher'' model. Termed ``synthetic data'', these generations are often produced before any student finetuning, but some work has considered generating new synthetic samples as training progresses. This paper studies and advocates for the latter case, where data are generated in an iterative, closed-loop fashion that is guided by the current state of the student model. For a fixed budget of generated samples, or a budget in terms of compute spent querying a teacher, we show that this curation of finetuning data affords improved student performance over static generation. Further, while there have been several LLM-specific methods proposed that operate in this regime, we find that simple, inexpensive selection criteria from the active learning literature tend to be most performant. We validate these claims across four mathematical and logical reasoning datasets using four different small language models.

Problem

Research questions and friction points this paper is trying to address.

Optimizes synthetic data generation for language model finetuning

Compares iterative versus static synthetic data generation methods

Evaluates active learning criteria for selecting synthetic training samples

Innovation

Methods, ideas, or system contributions that make the work stand out.

Active synthetic data generation with closed-loop feedback

Using simple active learning criteria for sample selection

Iterative teacher-student finetuning improves model performance

🔎 Similar Papers

CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation