Knowledge Base Construction for Knowledge-Augmented Text-to-SQL

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-SQL faces challenges in domain knowledge coverage and generalization to unseen databases and diverse queries. To address this, we propose the first universal, transferable paradigm for constructing structured knowledge bases for Text-to-SQL, integrating database schemas, natural language questions, and semantically grounded associations. Our approach automatically extracts and organizes such knowledge—enabling zero-shot adaptation to novel databases—within a retrieval-augmented generation (RAG) framework, eliminating reliance on manual annotation or one-time generated knowledge. Evaluated across multiple standard benchmarks—including cross-domain and non-overlapping database settings—our method achieves an average 8.2%+ improvement in SQL execution accuracy over state-of-the-art baselines. This work establishes a reusable, cross-database, cross-domain knowledge infrastructure for Text-to-SQL, advancing robustness and scalability beyond task-specific fine-tuning.

Technology Category

Application Category

📝 Abstract
Text-to-SQL aims to translate natural language queries into SQL statements, which is practical as it enables anyone to easily retrieve the desired information from databases. Recently, many existing approaches tackle this problem with Large Language Models (LLMs), leveraging their strong capability in understanding user queries and generating corresponding SQL code. Yet, the parametric knowledge in LLMs might be limited to covering all the diverse and domain-specific queries that require grounding in various database schemas, which makes generated SQLs less accurate oftentimes. To tackle this, we propose constructing the knowledge base for text-to-SQL, a foundational source of knowledge, from which we retrieve and generate the necessary knowledge for given queries. In particular, unlike existing approaches that either manually annotate knowledge or generate only a few pieces of knowledge for each query, our knowledge base is comprehensive, which is constructed based on a combination of all the available questions and their associated database schemas along with their relevant knowledge, and can be reused for unseen databases from different datasets and domains. We validate our approach on multiple text-to-SQL datasets, considering both the overlapping and non-overlapping database scenarios, where it outperforms relevant baselines substantially.
Problem

Research questions and friction points this paper is trying to address.

Enhancing Text-to-SQL accuracy with domain-specific knowledge
Overcoming LLM limitations in diverse database queries
Constructing reusable knowledge base for unseen databases
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructing a knowledge base for Text-to-SQL
Retrieving necessary knowledge from the base
Reusing knowledge across diverse databases
🔎 Similar Papers