Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

📅 2024-09-12

🏛️ arXiv.org

📈 Citations: 13

✨ Influential: 1

career value

143K/year

🤖 AI Summary

Existing synthetic data often suffer from low quality and conspicuous artificial artifacts, limiting their effectiveness in enhancing large language models’ (LLMs) capabilities in structured understanding, complex reasoning, and tool invocation. To address this, we propose a source-data-driven framework for synthetic data generation and filtering: it constructs intermediate reasoning chains by parsing real-world data and introduces an answer-solvability assessment mechanism to enable fully automated, label-free sample selection—yielding high-quality, verifiable training instances. This work pioneers the “source-data-driven generation + solvability-driven filtering” paradigm, eliminating reliance on manual annotation. Empirical evaluation demonstrates substantial improvements: +25.51% accuracy on WikiSQL and +22.57% on HotPotQA, significantly outperforming supervised fine-tuning baselines.

Technology Category

Application Category

📝 Abstract

Large Language Models still struggle in challenging scenarios that leverage structured data, complex reasoning, or tool usage. In this paper, we propose Source2Synth: a new method that can be used for teaching LLMs new skills without relying on costly human annotations. Source2Synth takes as input a custom data source and produces synthetic data points with intermediate reasoning steps grounded in real-world sources. Source2Synth improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two challenging domains: we test reasoning abilities in multi-hop question answering (MHQA), and tool usage in tabular question answering (TQA). Our method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on HotPotQA compared to the fine-tuned baselines.

Problem

Research questions and friction points this paper is trying to address.

Generating high-quality synthetic data for LLMs

Improving data quality by filtering low-quality generations

Enhancing reasoning and tool usage in question answering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates synthetic data from real sources

Filters low-quality data via answerability checks

Applies to both document and table tasks

🔎 Similar Papers

Machine Learning for Synthetic Data Generation: a Review

2023-02-08arXiv.orgCitations: 122

💼 Related Jobs

PhD GenAI Research Scientist Intern

Databricks

SF Bay Area Hourly Rate$54—$60 USD

San Francisco, CA, USA

AI Research Scientist (Technical Leadership), Data Research - MSL FAIR