CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

📅 2024-09-03

🏛️ arXiv.org

📈 Citations: 7

✨ Influential: 1

career value

176K/year

🤖 AI Summary

Addressing the challenge of constructing high-quality, domain-specific annotated data—often costly and labor-intensive—this paper proposes a few-shot-driven synthetic data generation paradigm. Given only a small set of user-provided examples, the method retrieves semantically relevant real-world text from large-scale web corpora and leverages instruction-tuned large language models (LLMs) to automatically generate well-formatted, task-specific synthetic training data. It is the first approach to synergistically integrate corpus retrieval with LLM-based augmentation, enabling zero human annotation, domain adaptability, and efficient few-shot generalization. Empirical evaluation across biomedical, medical, and commonsense question answering (QA), as well as summarization tasks, demonstrates that models trained on the generated data achieve a 46-point preference score improvement over human-annotated baselines in summarization, while QA models match or surpass the performance of general-purpose foundation models.

Technology Category

Application Category

📝 Abstract

Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given the few-shot examples, we use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology question-answering (QA), medicine QA and commonsense QA as well as summarization. Our experiments show that CRAFT-based models outperform or achieve comparable performance to general LLMs for QA tasks, while CRAFT-based summarization models outperform models trained on human-curated data by 46 preference points.

Problem

Research questions and friction points this paper is trying to address.

Generates synthetic datasets for specialized tasks efficiently

Uses corpus retrieval and LLM augmentation for customization

Outperforms human-curated and other synthetic data methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieves documents using similarity-based corpus retrieval

Augments data with instruction-tuned LLMs

Generates custom-formatted synthetic task samples

🔎 Similar Papers

No similar papers found.