Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL

📅 2024-06-12

🏛️ arXiv.org

📈 Citations: 25

✨ Influential: 1

career value

161K/year

🤖 AI Summary

This paper addresses the limitations of traditional pre-trained language models (PLMs) in text-to-SQL tasks under the large language model (LLM) era—namely, poor generalization, high generation error rates, and prohibitive adaptation costs. We systematically survey LLM-driven natural language-to-SQL generation techniques. We propose the first structured, knowledge-graph-inspired survey framework and formally characterize the paradigm shift from PLM fine-tuning to emerging approaches: prompt engineering, retrieval-augmented generation (RAG), database-schema-aware encoding, multi-step reasoning, and in-context learning. We comprehensively catalog mainstream benchmarks, evaluation metrics, and technical challenges, with particular emphasis on critical open issues including scalability and robustness. Our work provides researchers with a clear evolutionary trajectory and practitioners with a reusable technology roadmap and concrete directions for future advancement.

Technology Category

Application Category

📝 Abstract

Generating accurate SQL from users' natural language questions (text-to-SQL) remains a long-standing challenge due to the complexities involved in user question understanding, database schema comprehension, and SQL generation. Traditional text-to-SQL systems, which combine human engineering and deep neural networks, have made significant progress. Subsequently, pre-trained language models (PLMs) have been developed for text-to-SQL tasks, achieving promising results. However, as modern databases and user questions grow more complex, PLMs with a limited parameter size often produce incorrect SQL. This necessitates more sophisticated and tailored optimization methods, which restricts the application of PLM-based systems. Recently, large language models (LLMs) have shown significant capabilities in natural language understanding as model scale increases. Thus, integrating LLM-based solutions can bring unique opportunities, improvements, and solutions to text-to-SQL research. In this survey, we provide a comprehensive review of existing LLM-based text-to-SQL studies. Specifically, we offer a brief overview of the technical challenges and evolutionary process of text-to-SQL. Next, we introduce the datasets and metrics designed to evaluate text-to-SQL systems. Subsequently, we present a systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we make a summarization and discuss the remaining challenges in this field and suggest expectations for future research directions.

Problem

Research questions and friction points this paper is trying to address.

Improving SQL generation from natural language queries.

Addressing limitations of pre-trained language models in text-to-SQL.

Exploring large language models for advanced text-to-SQL solutions.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes large language models

Enhances text-to-SQL accuracy

Overcomes traditional model limitations

🔎 Similar Papers

A Survey on Employing Large Language Models for Text-to-SQL Tasks