Automatic Metadata Extraction for Text-to-SQL

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

The core challenge in text-to-SQL generation lies in the scarcity of high-quality database metadata in real-world scenarios, where manual annotation is costly and inherently incomplete. To address this, we propose the first systematic framework for automatic metadata extraction, integrating three complementary techniques: (1) structural schema summarization, (2) historical query log mining, and (3) LLM-driven (GPT-4o) SQL-to-text semantic generation. Our approach achieves end-to-end state-of-the-art performance under stringent zero-annotation and zero-log conditions—i.e., without expert annotations or access to query logs. On the BIRD benchmark, our method has consistently ranked first since September 2024 across both oracle and non-oracle evaluation settings and remains the top-performing system as of May 2025. This advancement significantly enhances the robustness and generalization capability of text-to-SQL models in low-resource settings.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have recently become sophisticated enough to automate many tasks ranging from pattern finding to writing assistance to code generation. In this paper, we examine text-to-SQL generation. We have observed from decades of experience that the most difficult part of query development lies in understanding the database contents. These experiences inform the direction of our research. Text-to-SQL benchmarks such as SPIDER and Bird contain extensive metadata that is generally not available in practice. Human-generated metadata requires the use of expensive Subject Matter Experts (SMEs), who are often not fully aware of many aspects of their databases. In this paper, we explore techniques for automatic metadata extraction to enable text-to-SQL generation. Ee explore the use of two standard and one newer metadata extraction techniques: profiling, query log analysis, and SQL-to text generation using an LLM. We use BIRD benchmark [JHQY+23] to evaluate the effectiveness of these techniques. BIRD does not provide query logs on their test database, so we prepared a submission that uses profiling alone, and does not use any specially tuned model (we used GPT-4o). From Sept 1 to Sept 23, 2024, and Nov 11 through Nov 23, 2024 we achieved the highest score both with and without using the"oracle"information provided with the question set. We regained the number 1 spot on Mar 11, 2025, and are still at #1 at the time of the writing (May, 2025).

Problem

Research questions and friction points this paper is trying to address.

Automating metadata extraction for text-to-SQL generation

Reducing reliance on human experts for database understanding

Evaluating profiling and LLM-based techniques for metadata extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic metadata extraction for Text-to-SQL

Profiling and query log analysis techniques

SQL-to-text generation using LLM (GPT-4o)

🔎 Similar Papers

A Survey on Employing Large Language Models for Text-to-SQL Tasks

2024-07-21arXiv.orgCitations: 24

💼 Related Jobs

Machine Learning Engineer, PhD Intern

Instacart

CA, NY, CT, NJ$50—$50 USDWA$47.50—$47.50 USDOR, DE, ME, MA, MD, NH, RI, VT, DC, PA, VA, CO, TX, IL, HI$44—$44 USDAll other states$42—$42 USD

remote

Research Scientist, AI Language