TCSR-SQL: Towards Table Content-aware Text-to-SQL with Self-retrieval

📅 2024-07-01
🏛️ CAAI Transactions on Intelligence Technology
📈 Citations: 2
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
In real-world Text-to-SQL tasks, natural language questions frequently contain ambiguous data keywords or reference non-existent column names in the database, causing significant performance degradation in existing large language model (LLM)-based approaches. To address this, we propose the first table-content-aware self-retrieval Text-to-SQL framework: it dynamically extracts semantic keywords via LLM in-context learning, integrates fuzzy database retrieval with knowledge-augmented schema inference to reconstruct plausible database schemas, and employs iterative “generate–execute–revise” cycles to refine SQL queries. Evaluated on our novel, manually curated benchmark of 2,115 table-content-sensitive instances, our method achieves an execution accuracy gain of over 27.8% relative to state-of-the-art methods. This work is the first to systematically resolve two core challenges in content-driven SQL generation: keyword ambiguity and schema mismatch.

Technology Category

Application Category

📝 Abstract
Large language model‐based (LLM‐based) text‐to‐SQL methods have achieved important progress in generating SQL queries for real‐world applications. When confronted with table content‐aware questions in real‐world scenarios, ambiguous data content keywords and nonexistent database schema column names within the question lead to the poor performance of existing methods. To solve this problem, we propose a novel approach towards table content‐aware text‐to‐SQL with self‐retrieval (TCSR‐SQL). It leverages LLM's in‐context learning capability to extract data content keywords within the question and infer possible related database schema, which is used to generate Seed SQL to fuzz search databases. The search results are further used to confirm the encoding knowledge with the designed encoding knowledge table, including column names and exact stored content values used in the SQL. The encoding knowledge is sent to obtain the final Precise SQL following multi‐rounds of generation‐execution‐revision process. To validate our approach, we introduce a table‐content‐aware, question‐related benchmark dataset, containing 2115 question‐SQL pairs. Comprehensive experiments conducted on this benchmark demonstrate the remarkable performance of TCSR‐SQL, achieving an improvement of at least 27.8% in execution accuracy compared to other state‐of‐the‐art methods.
Problem

Research questions and friction points this paper is trying to address.

Resolving ambiguous data content keywords in Text-to-SQL
Handling nonexistent database schema column names in questions
Improving SQL generation accuracy for table content-aware scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses self-retrieval to extract data content keywords
Generates seed SQL for fuzzy database searching
Employs multi-round generation-execution-revision for precise SQL
🔎 Similar Papers
Wenbo Xu
Wenbo Xu
Sun Yat-sen University
MultimodalMultimedia
L
Liang Yan
Department of Computer Science, Harbin Institute of Technology (Shenzhen), Shenzhen, China | Inspur Cloud Information Technology Co. Ltd, Jinan, China
P
Peiyi Han
Department of Computer Science, Harbin Institute of Technology (Shenzhen), Shenzhen, China | PengCheng Laboratory, Shenzhen, China
H
Haifeng Zhu
Department of Computer Science, Harbin Institute of Technology (Shenzhen), Shenzhen, China
Chuanyi Liu
Chuanyi Liu
Pengcheng Laboratory, Harbin Institute of Technology, Shenzhen
Cloud ComputingCloud SecurityPrivacy Enhanced Technologies
S
Shaoming Duan
C
Cuiyun Gao
Y
Yingwei Liang
Guangdong Power Grid Co. Ltd, Guangzhou, China