Towards Active Synthetic Data Generation for Finetuning Language Models

📅 2025-11-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited adaptability of static synthetic data generation in language model fine-tuning. We propose a dynamic closed-loop synthetic data generation paradigm: during training, samples generated by a teacher model are actively selected based on the student model’s current state—such as prediction uncertainty and hidden-layer activations—enabling iterative optimization via “generate–evaluate–select–fine-tune”. Our key contribution is a lightweight, interpretable active selection strategy that significantly outperforms complex sampling methods. Evaluated on four mathematical and logical reasoning benchmarks, our approach consistently improves the performance of four small language models under fixed computational budgets, yielding average accuracy gains of 3.2–5.7 percentage points. These results demonstrate the method’s effectiveness, generalizability across diverse models and tasks, and computational efficiency.

Technology Category

Application Category

📝 Abstract
A common and effective means for improving language model capabilities involves finetuning a ``student'' language model's parameters on generations from a more proficient ``teacher'' model. Termed ``synthetic data'', these generations are often produced before any student finetuning, but some work has considered generating new synthetic samples as training progresses. This paper studies and advocates for the latter case, where data are generated in an iterative, closed-loop fashion that is guided by the current state of the student model. For a fixed budget of generated samples, or a budget in terms of compute spent querying a teacher, we show that this curation of finetuning data affords improved student performance over static generation. Further, while there have been several LLM-specific methods proposed that operate in this regime, we find that simple, inexpensive selection criteria from the active learning literature tend to be most performant. We validate these claims across four mathematical and logical reasoning datasets using four different small language models.
Problem

Research questions and friction points this paper is trying to address.

Optimizes synthetic data generation for language model finetuning
Compares iterative versus static synthetic data generation methods
Evaluates active learning criteria for selecting synthetic training samples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Active synthetic data generation with closed-loop feedback
Using simple active learning criteria for sample selection
Iterative teacher-student finetuning improves model performance
🔎 Similar Papers
No similar papers found.
Samuel Kessler
Samuel Kessler
Microsoft
Machine learning
Menglin Xia
Menglin Xia
Microsoft
NLP
D
Daniel Madrigal Diaz
Microsoft
Dongge Han
Dongge Han
Microsoft
LLMsRecommender SystemsReinforcement LearningMultiagent SystemsGame Theory
H
Helia Heshemi
Microsoft
S
Saravan Rajmohan
Microsoft
V
Victor Ruhle
Microsoft
J
Jordan T. Ash
Microsoft Research NYC