Text-to-SQL Domain Adaptation via Human-LLM Collaborative Data Annotation

📅 2025-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the sharp performance degradation of Text-to-SQL models on unseen database schemas due to scarce domain-specific annotated data, this paper proposes SQLsynth—a human-in-the-loop annotation framework. SQLsynth introduces a novel structured collaboration paradigm integrating large language model–based SQL generation, expert verification, and interactive feedback, augmented with SQL syntactic constraints, natural language diversity enhancement, and cognitive load optimization modules. Through the first user study on such a framework, we empirically demonstrate its comprehensive advantages: annotation throughput increases by 2.3× and error rate decreases by 41%, while preserving semantic diversity. When used to construct training data for downstream fine-tuning, SQLsynth-annotated datasets improve cross-domain Text-to-SQL accuracy by an average of 19.6%, significantly alleviating the high cost of acquiring high-quality labeled data in new domains.

Technology Category

Application Category

📝 Abstract
Text-to-SQL models, which parse natural language (NL) questions to executable SQL queries, are increasingly adopted in real-world applications. However, deploying such models in the real world often requires adapting them to the highly specialized database schemas used in specific applications. We find that existing text-to-SQL models experience significant performance drops when applied to new schemas, primarily due to the lack of domain-specific data for fine-tuning. This data scarcity also limits the ability to effectively evaluate model performance in new domains. Continuously obtaining high-quality text-to-SQL data for evolving schemas is prohibitively expensive in real-world scenarios. To bridge this gap, we propose SQLsynth, a human-in-the-loop text-to-SQL data annotation system. SQLsynth streamlines the creation of high-quality text-to-SQL datasets through human-LLM collaboration in a structured workflow. A within-subjects user study comparing SQLsynth with manual annotation and ChatGPT shows that SQLsynth significantly accelerates text-to-SQL data annotation, reduces cognitive load, and produces datasets that are more accurate, natural, and diverse. Our code is available at https://github.com/adobe/nl_sql_analyzer.
Problem

Research questions and friction points this paper is trying to address.

Adapt Text-to-SQL models to specialized database schemas
Address the scarcity of domain-specific Text-to-SQL data
Enhance Text-to-SQL data annotation efficiency and accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-LLM collaboration
Structured workflow
Accelerates annotation
🔎 Similar Papers
No similar papers found.
Y
Yuan Tian
Purdue University, West Lafayette, Indiana, USA
D
Daniel Lee
Adobe Inc., San Jose, California, USA
F
Fei Wu
Adobe Inc., Seattle, Washington, USA
Tung Mai
Tung Mai
Adobe Research
Algorithms
K
Kun Qian
Adobe Inc., Seattle, Washington, USA
S
Siddhartha Sahai
Adobe Inc., Seattle, Washington, USA
T
Tianyi Zhang
Purdue University, West Lafayette, Indiana, USA
Yunyao Li
Yunyao Li
Director of Machine Learning, Adobe Experience Platform
Natural Language ProcessingMachine LearningHuman Computer InteractionData Management