Automatic database description generation for Text-to-SQL

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

To bridge the semantic gap in Text-to-SQL caused by the absence of natural language descriptions for database schemas, this paper proposes a dual-path collaborative schema description generation method. We introduce, for the first time, a bidirectional generation paradigm—“coarse-to-fine” and “fine-to-coarse”—to jointly capture global structural understanding and local semantic precision. Our approach employs a two-stage, large language model–based prompting framework that integrates schema structure analysis with hierarchical semantic reasoning. Evaluated on the Bird benchmark, our method improves SQL execution accuracy by 0.93% and achieves description quality at 37% of human-written level, significantly narrowing the semantic gap between natural language and SQL. The source code is publicly available.

Technology Category

Application Category

📝 Abstract

In the context of the Text-to-SQL task, table and column descriptions are crucial for bridging the gap between natural language and database schema. This report proposes a method for automatically generating effective database descriptions when explicit descriptions are unavailable. The proposed method employs a dual-process approach: a coarse-to-fine process, followed by a fine-to-coarse process. The coarse-to-fine approach leverages the inherent knowledge of LLM to guide the understanding process from databases to tables and finally to columns. This approach provides a holistic understanding of the database structure and ensures contextual alignment. Conversely, the fine-to-coarse approach starts at the column level, offering a more accurate and nuanced understanding when stepping back to the table level. Experimental results on the Bird benchmark indicate that using descriptions generated by the proposed improves SQL generation accuracy by 0.93% compared to not using descriptions, and achieves 37% of human-level performance. The source code is publicly available at https://github.com/XGenerationLab/XiYan-DBDescGen.

Problem

Research questions and friction points this paper is trying to address.

Generates database descriptions for Text-to-SQL tasks.

Uses dual-process approach for accurate schema understanding.

Improves SQL generation accuracy by 0.93%.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-process approach for database description generation

Coarse-to-fine and fine-to-coarse understanding processes

Improves SQL generation accuracy by 0.93%

🔎 Similar Papers

Unmasking Database Vulnerabilities: Zero-Knowledge Schema Inference Attacks in Text-to-SQL Systems