π€ AI Summary
This paper addresses the limitations of traditional pre-trained language models (PLMs) in text-to-SQL tasks under the large language model (LLM) eraβnamely, poor generalization, high generation error rates, and prohibitive adaptation costs. We systematically survey LLM-driven natural language-to-SQL generation techniques. We propose the first structured, knowledge-graph-inspired survey framework and formally characterize the paradigm shift from PLM fine-tuning to emerging approaches: prompt engineering, retrieval-augmented generation (RAG), database-schema-aware encoding, multi-step reasoning, and in-context learning. We comprehensively catalog mainstream benchmarks, evaluation metrics, and technical challenges, with particular emphasis on critical open issues including scalability and robustness. Our work provides researchers with a clear evolutionary trajectory and practitioners with a reusable technology roadmap and concrete directions for future advancement.
π Abstract
Generating accurate SQL from users' natural language questions (text-to-SQL) remains a long-standing challenge due to the complexities involved in user question understanding, database schema comprehension, and SQL generation. Traditional text-to-SQL systems, which combine human engineering and deep neural networks, have made significant progress. Subsequently, pre-trained language models (PLMs) have been developed for text-to-SQL tasks, achieving promising results. However, as modern databases and user questions grow more complex, PLMs with a limited parameter size often produce incorrect SQL. This necessitates more sophisticated and tailored optimization methods, which restricts the application of PLM-based systems. Recently, large language models (LLMs) have shown significant capabilities in natural language understanding as model scale increases. Thus, integrating LLM-based solutions can bring unique opportunities, improvements, and solutions to text-to-SQL research. In this survey, we provide a comprehensive review of existing LLM-based text-to-SQL studies. Specifically, we offer a brief overview of the technical challenges and evolutionary process of text-to-SQL. Next, we introduce the datasets and metrics designed to evaluate text-to-SQL systems. Subsequently, we present a systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we make a summarization and discuss the remaining challenges in this field and suggest expectations for future research directions.