🤖 AI Summary
To address the sharp performance degradation of Text-to-SQL models on unseen database schemas due to scarce domain-specific annotated data, this paper proposes SQLsynth—a human-in-the-loop annotation framework. SQLsynth introduces a novel structured collaboration paradigm integrating large language model–based SQL generation, expert verification, and interactive feedback, augmented with SQL syntactic constraints, natural language diversity enhancement, and cognitive load optimization modules. Through the first user study on such a framework, we empirically demonstrate its comprehensive advantages: annotation throughput increases by 2.3× and error rate decreases by 41%, while preserving semantic diversity. When used to construct training data for downstream fine-tuning, SQLsynth-annotated datasets improve cross-domain Text-to-SQL accuracy by an average of 19.6%, significantly alleviating the high cost of acquiring high-quality labeled data in new domains.
📝 Abstract
Text-to-SQL models, which parse natural language (NL) questions to executable SQL queries, are increasingly adopted in real-world applications. However, deploying such models in the real world often requires adapting them to the highly specialized database schemas used in specific applications. We find that existing text-to-SQL models experience significant performance drops when applied to new schemas, primarily due to the lack of domain-specific data for fine-tuning. This data scarcity also limits the ability to effectively evaluate model performance in new domains. Continuously obtaining high-quality text-to-SQL data for evolving schemas is prohibitively expensive in real-world scenarios. To bridge this gap, we propose SQLsynth, a human-in-the-loop text-to-SQL data annotation system. SQLsynth streamlines the creation of high-quality text-to-SQL datasets through human-LLM collaboration in a structured workflow. A within-subjects user study comparing SQLsynth with manual annotation and ChatGPT shows that SQLsynth significantly accelerates text-to-SQL data annotation, reduces cognitive load, and produces datasets that are more accurate, natural, and diverse. Our code is available at https://github.com/adobe/nl_sql_analyzer.