CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address semantic misalignment between natural language questions (NLQs) and SQL in large-scale databases—causing challenges in schema linking and generation drift—this paper proposes CRED-SQL, a novel Text-to-SQL framework. Methodologically, it introduces Executable Description Language (EDL) as an interpretable, executable intermediate representation, decomposing Text-to-SQL into two sequential stages: Text-to-EDL and EDL-to-SQL. It further designs a clustering-driven schema retrieval mechanism to enhance semantic alignment between NLQs and schema elements, and integrates LLM-powered NL rewriting with EDL generation. Evaluated on SpiderUnion and BirdUnion benchmarks, CRED-SQL achieves new state-of-the-art performance, demonstrating significant improvements in cross-domain generalization, execution accuracy, and system scalability.

Technology Category

Application Category

📝 Abstract
Recent advances in large language models (LLMs) have significantly improved the accuracy of Text-to-SQL systems. However, a critical challenge remains: the semantic mismatch between natural language questions (NLQs) and their corresponding SQL queries. This issue is exacerbated in large-scale databases, where semantically similar attributes hinder schema linking and semantic drift during SQL generation, ultimately reducing model accuracy. To address these challenges, we introduce CRED-SQL, a framework designed for large-scale databases that integrates Cluster Retrieval and Execution Description. CRED-SQL first performs cluster-based large-scale schema retrieval to pinpoint the tables and columns most relevant to a given NLQ, alleviating schema mismatch. It then introduces an intermediate natural language representation-Execution Description Language (EDL)-to bridge the gap between NLQs and SQL. This reformulation decomposes the task into two stages: Text-to-EDL and EDL-to-SQL, leveraging LLMs' strong general reasoning capabilities while reducing semantic deviation. Extensive experiments on two large-scale, cross-domain benchmarks-SpiderUnion and BirdUnion-demonstrate that CRED-SQL achieves new state-of-the-art (SOTA) performance, validating its effectiveness and scalability. Our code is available at https://github.com/smduan/CRED-SQL.git
Problem

Research questions and friction points this paper is trying to address.

Addresses semantic mismatch in Text-to-SQL for large databases
Improves schema linking via cluster retrieval for accurate SQL generation
Introduces Execution Description Language to reduce semantic deviation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cluster-based schema retrieval for relevant tables
Execution Description Language bridges NLQ-SQL gap
Two-stage Text-to-EDL and EDL-to-SQL decomposition
🔎 Similar Papers
No similar papers found.
S
Shaoming Duan
Harbin Institute of Technology, Shenzhen, China; Pengcheng Laboratory, Shenzhen, China
Z
Zirui Wang
Harbin Institute of Technology, Shenzhen, China
Chuanyi Liu
Chuanyi Liu
Pengcheng Laboratory, Harbin Institute of Technology, Shenzhen
Cloud ComputingCloud SecurityPrivacy Enhanced Technologies
Z
Zhibin Zhu
Harbin Institute of Technology, Shenzhen, China
Y
Yuhao Zhang
Harbin Institute of Technology, Shenzhen, China; Mindflow.ai
P
Peiyi Han
Harbin Institute of Technology, Shenzhen, China; Pengcheng Laboratory, Shenzhen, China
L
Liang Yan
Harbin Institute of Technology, Shenzhen, China; Inspur Cloud Information Technology Co., Ltd, Jinan 250101, China
Z
Zewu Peng
Guangdong Power Grid Co., Ltd, China