A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?

📅 2024-08-09

🏛️ arXiv.org

📈 Citations: 15

✨ Influential: 0

career value

143K/year

🤖 AI Summary

This paper addresses the natural language-to-SQL (NL2SQL) task empowered by large language models (LLMs), providing a systematic survey of its full lifecycle. Methodologically, it establishes a unified analytical framework across four dimensions: model design (schema- and instance-aware modeling), data construction (LLM-driven synthetic data generation), multi-granularity evaluation (spanning syntactic, executional, and semantic correctness), and error attribution (root-cause-driven fine-grained classification analysis). The key contributions are threefold: (1) it introduces, for the first time, an integrated full-lifecycle perspective on NL2SQL in the LLM era; (2) it formulates a development guideline balancing practicality and interpretability; and (3) it identifies core challenges—insufficient schema-aware reasoning, weak few-shot generalization, and poor robustness in real-world deployments—and maps them into a clear problem taxonomy and technical roadmap for future research.

Technology Category

Application Category

📝 Abstract

Translating users' natural language queries (NL) into SQL queries (i.e., NL2SQL, a.k.a., Text-to-SQL) can significantly reduce barriers to accessing relational databases and support various commercial applications. The performance of NL2SQL has been greatly enhanced with the emergence of Large Language Models (LLMs). In this survey, we provide a comprehensive review of NL2SQL techniques powered by LLMs, covering its entire lifecycle from the following four aspects: (1) Model: NL2SQL translation techniques that tackle not only NL ambiguity and under-specification, but also properly map NL with database schema and instances; (2) Data: From the collection of training data, data synthesis due to training data scarcity, to NL2SQL benchmarks; (3) Evaluation: Evaluating NL2SQL methods from multiple angles using different metrics and granularities; and (4) Error Analysis: analyzing NL2SQL errors to find the root cause and guiding NL2SQL models to evolve. Moreover, we provide a rule of thumb for developing NL2SQL solutions. Finally, we discuss the research challenges and open problems of NL2SQL in the LLMs era.

Problem

Research questions and friction points this paper is trying to address.

Enhancing NL2SQL translation using Large Language Models.

Addressing NL ambiguity and database schema mapping challenges.

Evaluating and improving NL2SQL methods through comprehensive analysis.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs enhance NL2SQL translation accuracy.

Data synthesis addresses training data scarcity.

Error analysis guides NL2SQL model evolution.

🔎 Similar Papers

A Survey on Employing Large Language Models for Text-to-SQL Tasks