Retrieval and Augmentation of Domain Knowledge for Text-to-SQL Semantic Parsing

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing text-to-SQL methods exhibit significant performance variance in cross-database generalization, primarily due to the lack of systematic alignment between domain semantics embedded in natural language queries and structural patterns in database schemas, compounded by inefficient, non-generalizable manual prompt engineering for domain knowledge injection. Method: We propose a structured-domain-knowledge-based multi-database text-to-SQL framework that explicitly models domain knowledge as retrievable, structured statements; employs lightweight substring matching for database-adaptive retrieval; and seamlessly integrates retrieved knowledge into the LLM’s reasoning pipeline—eliminating reliance on handcrafted prompts. Contribution/Results: Evaluated across 11 real-world databases and 5 open-source and commercial LLMs, our approach achieves substantial gains in SQL execution accuracy over strong baselines. It is the first to enable plug-and-play cross-database transfer of domain knowledge, markedly improving model robustness in understanding semantic correspondences between domain vocabulary and schema elements.

Technology Category

Application Category

📝 Abstract

The performance of Large Language Models (LLMs) for translating Natural Language (NL) queries into SQL varies significantly across databases (DBs). NL queries are often expressed using a domain specific vocabulary, and mapping these to the correct SQL requires an understanding of the embedded domain expressions, their relationship to the DB schema structure. Existing benchmarks rely on unrealistic, ad-hoc query specific textual hints for expressing domain knowledge. In this paper, we propose a systematic framework for associating structured domain statements at the database level. We present retrieval of relevant structured domain statements given a user query using sub-string level match. We evaluate on eleven realistic DB schemas covering diverse domains across five open-source and proprietary LLMs and demonstrate that (1) DB level structured domain statements are more practical and accurate than existing ad-hoc query specific textual domain statements, and (2) Our sub-string match based retrieval of relevant domain statements provides significantly higher accuracy than other retrieval approaches.

Problem

Research questions and friction points this paper is trying to address.

Mapping domain-specific vocabulary to SQL queries

Improving LLM performance across diverse databases

Retrieving structured domain knowledge for semantic parsing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic framework for structured domain statements

Sub-string match retrieval of domain knowledge

Database-level augmentation for text-to-SQL parsing

🔎 Similar Papers

Enhancing Text-to-SQL Capabilities of Large Language Models via Domain Database Knowledge Injection