XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL

📅 2025-07-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient semantic parsing accuracy in Text-to-SQL tasks, this paper proposes a multi-generator collaborative framework. First, multiple SQL generators—diverse in output format and jointly fine-tuned on schema-aware and text-SQL alignment objectives—are constructed to enhance candidate diversity. Second, schema-guided filtering and structured candidate reorganization improve semantic consistency. Finally, a lightweight selection model discriminates the optimal SQL query. Evaluated on BIRD, the method achieves 75.63% accuracy (state-of-the-art), and 89.65% on Spider’s test set—substantially outperforming existing single-generator and ensemble approaches. Key contributions include: (1) multi-format fine-tuning for controllable generation diversity; (2) a structured candidate reorganization mechanism enforcing syntactic and semantic coherence; and (3) a low-overhead selection optimization paradigm that avoids costly end-to-end retraining while preserving performance.

Technology Category

Application Category

📝 Abstract
To leverage the advantages of LLM in addressing challenges in the Text-to-SQL task, we present XiYan-SQL, an innovative framework effectively generating and utilizing multiple SQL candidates. It consists of three components: 1) a Schema Filter module filtering and obtaining multiple relevant schemas; 2) a multi-generator ensemble approach generating multiple highquality and diverse SQL queries; 3) a selection model with a candidate reorganization strategy implemented to obtain the optimal SQL query. Specifically, for the multi-generator ensemble, we employ a multi-task fine-tuning strategy to enhance the capabilities of SQL generation models for the intrinsic alignment between SQL and text, and construct multiple generation models with distinct generation styles by fine-tuning across different SQL formats. The experimental results and comprehensive analysis demonstrate the effectiveness and robustness of our framework. Overall, XiYan-SQL achieves a new SOTA performance of 75.63% on the notable BIRD benchmark, surpassing all previous methods. It also attains SOTA performance on the Spider test set with an accuracy of 89.65%.
Problem

Research questions and friction points this paper is trying to address.

Develop a multi-generator framework for Text-to-SQL conversion
Generate diverse high-quality SQL queries from text input
Select optimal SQL query using reorganization strategy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Schema Filter module for relevant schemas
Multi-generator ensemble for diverse SQL queries
Selection model with candidate reorganization strategy
🔎 Similar Papers
Y
Yifu Liu
Alibaba Cloud Computing Co., Ltd., Hangzhou, Zhejiang, China
Y
Yin Zhu
Alibaba Cloud Computing Co., Ltd., Hangzhou, Zhejiang, China
Y
Yingqi Gao
Alibaba Cloud Computing Co., Ltd., Hangzhou, Zhejiang, China
Zhiling Luo
Zhiling Luo
Alibaba Inc.
NLPLLMMachine Learning
X
Xiaoxia Li
Alibaba Cloud Computing Co., Ltd., Hangzhou, Zhejiang, China
X
Xiaorong Shi
Alibaba Cloud Computing Co., Ltd., Hangzhou, Zhejiang, China
Y
Yuntao Hong
Alibaba Cloud Computing Co., Ltd., Hangzhou, Zhejiang, China
Jinyang Gao
Jinyang Gao
Alibaba Group
Machine LearningLearning Systems.
Y
Yu Li
Alibaba Cloud Computing Co., Ltd., Hangzhou, Zhejiang, China
Bolin Ding
Bolin Ding
Alibaba Group
DatabasesData PrivacyMachine Learning
Jingren Zhou
Jingren Zhou
Alibaba Group, Microsoft
Cloud ComputingLarge Scale Distributed SystemsMachine LearningQuery ProcessingQuery