BARE: Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generation

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Synthetic data generation faces a fundamental trade-off between diversity and quality: base models exhibit high diversity but poor instruction adherence, whereas instruction-tuned models produce high-quality yet homogeneous outputs. Method: This paper proposes the two-stage Base-Refine (BARE) framework: (1) diverse initial samples are generated using a base model; (2) an instruction-tuned model refines them via re-ranking and content correction. The approach integrates base-model sampling, fine-tuned-model refinement, few-shot prompting, and human curation. Contribution/Results: We provide the first systematic analysis revealing the complementary strengths of base and instruction-tuned models in diversity versus instruction following—enabling a novel collaborative generation paradigm. Experiments show that fine-tuning small models on just 1,000 BARE-generated samples achieves state-of-the-art performance on LiveCodeBench. On GSM8K, accuracy exceeds that of models trained solely on instruction data by 101%; on RAFT, performance improves by 18.4% over prior SOTA.

Technology Category

Application Category

📝 Abstract

As the demand for high-quality data in model training grows, researchers and developers are increasingly generating synthetic data to tune and train LLMs. A common assumption about synthetic data is that sampling from instruct-tuned models is sufficient; however, these models struggle to produce diverse outputs-a key requirement for generalization. Despite various prompting methods, in this work we show that achieving meaningful diversity from instruct-tuned models remains challenging. In contrast, we find base models without post-training exhibit greater diversity, but are less capable at instruction following and hence of lower quality. Leveraging this insight, we propose Base-Refine (BARE), a synthetic data generation method that combines the diversity of base models with the quality of instruct-tuned models through a two-stage process. With minimal few-shot examples and curation, BARE generates diverse and high-quality datasets, improving downstream task performance. We show that fine-tuning with as few as 1,000 BARE-generated samples can reach performance comparable to the best similarly sized models on LiveCodeBench tasks. Furthermore, fine-tuning with BARE-generated data achieves a 101% improvement over instruct-only data on GSM8K and a 18.4% improvement over SOTA methods on RAFT.

Problem

Research questions and friction points this paper is trying to address.

Enhancing synthetic data diversity and quality

Combining base and instruct-tuned models effectively

Improving downstream task performance with minimal data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines base and instruct-tuned models

Two-stage synthetic data generation

Enhances diversity and quality

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

PhD GenAI Research Scientist Intern

Databricks

SF Bay Area Hourly Rate$54—$60 USD

San Francisco, CA, USA

Research Engineer, Post-Training - Meta Superintelligence Labs