SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Enterprises require domain-specific Text-to-SQL models trained and evaluated on proprietary databases, yet lack sufficient annotated SQL logs or human-labeled data. Method: This paper introduces the first fully automated two-stage synthetic data generation framework: (1) hierarchical sub-schema partitioning to reduce schema complexity; (2) multi-difficulty SQL generation, LLM-as-a-judge quality evaluation, executability verification, automatic error correction, and column-balanced sampling—enabling high-quality, in-domain training data without any SQL logs or manual annotation. We further investigate schema-free fine-tuning and schema-only contextual inference. Results: On the BIRD benchmark, SingSQL-LM-3B-R64 achieves 82.87% Soft F1 and 73.03% Execution Accuracy (EX) with 32 candidates—outperforming comparable 3B models by +16.21 points; the 1.5B variant also demonstrates significant gains.

Technology Category

Application Category

📝 Abstract

Translating natural language questions into SQL has become a core challenge in enabling non-technical users to query databases. While recent work has explored large-scale synthetic data generation to improve model performance through post-training, most efforts emphasize cross-domain generalization. This leaves a gap for real-world enterprise scenarios, where models need to specialize to a single database schema and organizations require to be able to evaluate their Text-to-SQL systems on their own databases. To address this, we introduce SING-SQL, a fully automated two-stage framework for generating high-quality, high-coverage synthetic Text-to-SQL data for any target database, without relying on SQL logs or manual annotations. Our approach hierarchically partitions a database schema into sub-schemas, synthesizes SQL queries across multiple complexity levels, and applies a quality-aware pipeline that includes LLM-as-a-judge validation, executability checks, automatic repair, and column balancing. We further release SingSQL-LM, a family of compact language models fine-tuned on the synthetic data, achieving strong in-domain generalization. On the subset of the BIRD benchmark, SingSQL-LM-3B-R64 reaches 82.87% Soft F1 and 73.03% EX upper bound with 32 candidates, outperforming the best 3B-scale baseline by +16.21 in Soft F1 and +12.36 in EX. At the 1.5B scale, SingSQL-LM-1.5B-R64 improves over prior systems by +9.30 in Soft F1 and +4.49 in EX. On synthetic evaluation sets, SingSQL-LMs exceed prior systems by wide margins, establishing state-of-the-art performance among open models at comparable scales. Our study of context management strategies reveals that schema-free fine-tuning combined with schema-only inference provides the most robust results. These findings establish SING-SQL as a scalable, database-agnostic paradigm for producing and evaluating enterprise-grade Text-to-SQL systems.

Problem

Research questions and friction points this paper is trying to address.

Generating synthetic Text-to-SQL data for specific database schemas

Automating high-quality SQL query synthesis without manual annotations

Improving in-domain generalization for enterprise database systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated synthetic data generation for target databases

Hierarchical schema partitioning and multi-level SQL synthesis

Quality-aware pipeline with LLM validation and repair

🔎 Similar Papers

A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?