🤖 AI Summary
To address the problem of excessive database schema redundancy in NL2SQL—leading to overly long prompts, high token consumption, and prohibitive inference costs—this paper proposes Datalake Agent, an agent-based interactive NL2SQL framework. It dynamically retrieves only essential metadata through iterative query understanding and on-demand active retrieval, drastically shortening input prompts. Unlike static full-schema injection, Datalake Agent autonomously determines *when*, *where*, and *what* metadata to retrieve during inference, enabling precise, context-aware schema provisioning. Experiments across 23 databases and 100 complex Text-to-SQL tasks demonstrate up to 87% reduction in token usage and significant latency improvement, while maintaining execution accuracy competitive with state-of-the-art methods. This work is the first to systematically introduce a lightweight, interactive agent paradigm into NL2SQL, offering a novel pathway toward efficient and scalable semantic parsing.
📝 Abstract
Translating natural language queries into SQL queries (NL2SQL or Text-to-SQL) has recently been empowered by large language models (LLMs). Using LLMs to perform NL2SQL methods on a large collection of SQL databases necessitates processing large quantities of meta-information about the databases, which in turn results in lengthy prompts with many tokens and high processing costs. To address this challenge, we introduce Datalake Agent, an agentic system designed to enable an LLM to solve NL2SQL tasks more efficiently. Instead of utilizing direct solvers for NL2SQL that call the LLM once with all meta-information in the prompt, the Datalake Agent employs an interactive loop to reduce the utilized meta-information. Within the loop, the LLM is used in a reasoning framework that selectively requests only the necessary information to solve a table question answering task. We evaluate the Datalake Agent on a collection of 23 databases with 100 table question answering tasks. The Datalake Agent reduces the tokens used by the LLM by up to 87% and thus allows for substantial cost reductions while maintaining competitive performance.