CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

To address semantic misalignment between natural language questions (NLQs) and SQL in large-scale databases—causing challenges in schema linking and generation drift—this paper proposes CRED-SQL, a novel Text-to-SQL framework. Methodologically, it introduces Executable Description Language (EDL) as an interpretable, executable intermediate representation, decomposing Text-to-SQL into two sequential stages: Text-to-EDL and EDL-to-SQL. It further designs a clustering-driven schema retrieval mechanism to enhance semantic alignment between NLQs and schema elements, and integrates LLM-powered NL rewriting with EDL generation. Evaluated on SpiderUnion and BirdUnion benchmarks, CRED-SQL achieves new state-of-the-art performance, demonstrating significant improvements in cross-domain generalization, execution accuracy, and system scalability.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) have significantly improved the accuracy of Text-to-SQL systems. However, a critical challenge remains: the semantic mismatch between natural language questions (NLQs) and their corresponding SQL queries. This issue is exacerbated in large-scale databases, where semantically similar attributes hinder schema linking and semantic drift during SQL generation, ultimately reducing model accuracy. To address these challenges, we introduce CRED-SQL, a framework designed for large-scale databases that integrates Cluster Retrieval and Execution Description. CRED-SQL first performs cluster-based large-scale schema retrieval to pinpoint the tables and columns most relevant to a given NLQ, alleviating schema mismatch. It then introduces an intermediate natural language representation-Execution Description Language (EDL)-to bridge the gap between NLQs and SQL. This reformulation decomposes the task into two stages: Text-to-EDL and EDL-to-SQL, leveraging LLMs' strong general reasoning capabilities while reducing semantic deviation. Extensive experiments on two large-scale, cross-domain benchmarks-SpiderUnion and BirdUnion-demonstrate that CRED-SQL achieves new state-of-the-art (SOTA) performance, validating its effectiveness and scalability. Our code is available at https://github.com/smduan/CRED-SQL.git

Problem

Research questions and friction points this paper is trying to address.

Addresses semantic mismatch in Text-to-SQL for large databases

Improves schema linking via cluster retrieval for accurate SQL generation

Introduces Execution Description Language to reduce semantic deviation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cluster-based schema retrieval for relevant tables

Execution Description Language bridges NLQ-SQL gap

Two-stage Text-to-EDL and EDL-to-SQL decomposition

🔎 Similar Papers

A Survey on Employing Large Language Models for Text-to-SQL Tasks