OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale

๐Ÿ“… 2025-03-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing text-to-SQL approaches rely heavily on proprietary large language models (LLMs), raising concerns regarding data privacy, limited customization, and poor generalization. To address these limitations, we propose a scalable, synthetic-data-driven paradigm: a database-aware, controllable LLM-based synthesis pipeline that jointly generates structured schema representations, corresponding SQL queries, natural-language questions, and chain-of-thought (CoT) reasoning tracesโ€”forming high-fidelity four-tuple samples. This yields SynSQL-2.5M, the first million-scale, high-quality synthetic dataset (2.5M samples spanning 16,000+ diverse databases). Leveraging SynSQL-2.5M, we train the open-source OmniSQL model family (7B, 14B, and 32B parameter variants). Evaluated across nine standard benchmarks, OmniSQL consistently outperforms all prior open-source methods and matches or exceeds the performance of GPT-4o and DeepSeek-V3. All code, data, and models are publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
Text-to-SQL, the task of translating natural language questions into SQL queries, plays a crucial role in enabling non-experts to interact with databases. While recent advancements in large language models (LLMs) have significantly enhanced text-to-SQL performance, existing approaches face notable limitations in real-world text-to-SQL applications. Prompting-based methods often depend on closed-source LLMs, which are expensive, raise privacy concerns, and lack customization. Fine-tuning-based methods, on the other hand, suffer from poor generalizability due to the limited coverage of publicly available training data. To overcome these challenges, we propose a novel and scalable text-to-SQL data synthesis framework for automatically synthesizing large-scale, high-quality, and diverse datasets without extensive human intervention. Using this framework, we introduce SynSQL-2.5M, the first million-scale text-to-SQL dataset, containing 2.5 million samples spanning over 16,000 synthetic databases. Each sample includes a database, SQL query, natural language question, and chain-of-thought (CoT) solution. Leveraging SynSQL-2.5M, we develop OmniSQL, a powerful open-source text-to-SQL model available in three sizes: 7B, 14B, and 32B. Extensive evaluations across nine datasets demonstrate that OmniSQL achieves state-of-the-art performance, matching or surpassing leading closed-source and open-source LLMs, including GPT-4o and DeepSeek-V3, despite its smaller size. We release all code, datasets, and models to support further research.
Problem

Research questions and friction points this paper is trying to address.

Overcoming limitations in text-to-SQL data synthesis
Addressing poor generalizability in fine-tuning-based methods
Providing scalable, high-quality text-to-SQL datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable text-to-SQL data synthesis framework
SynSQL-2.5M: Million-scale text-to-SQL dataset
OmniSQL: Open-source text-to-SQL model
๐Ÿ”Ž Similar Papers
No similar papers found.
H
Haoyang Li
School of Information, Renmin University of China, Beijing, China; Key Laboratory of Data Engineering and Knowledge Engineering, MOE, China; ByteDance Inc
Shang Wu
Shang Wu
Unknown affiliation
X
Xiaokang Zhang
School of Information, Renmin University of China, Beijing, China; Key Laboratory of Data Engineering and Knowledge Engineering, MOE, China
Xinmei Huang
Xinmei Huang
Renmin University of China
J
Jing Zhang
Engineering Research Center of Database and Business Intelligence, MOE, China; School of Information, Renmin University of China, Beijing, China
Fuxin Jiang
Fuxin Jiang
ByteDance
TimeSeries ForecastingResource SchedulingLLM
S
Shuai Wang
ByteDance Inc
Tieying Zhang
Tieying Zhang
Research Scientist at Bytedance
AI for SystemsSystems for AI
J
Jianjun Chen
ByteDance Inc
Rui Shi
Rui Shi
ByteDance, Inc.
Database SystemsBig DataDistributed SystemsCloud NativeProgramming Languages
H
Hong Chen
Engineering Research Center of Database and Business Intelligence, MOE, China; School of Information, Renmin University of China, Beijing, China
Cuiping Li
Cuiping Li
Renmin University of China
Databasebig data analysis and mining