π€ AI Summary
This work addresses the limitations of current large language modelβbased Text-to-SQL systems in complex database scenarios, where they often suffer from inefficient reasoning, unstable outputs, and poor scalability. To overcome these challenges, the authors propose an agent framework augmented with a semantic memory mechanism that transforms historical interaction logs into structured programs and systematically reuses successful reasoning paths to guide subsequent SQL generation. This approach replaces conventional scratchpad or vector-retrieval methods, enabling efficient and controllable multi-step reasoning. The method significantly improves both inference efficiency and output consistency, reducing token consumption by 25% and reasoning trajectory length by 35% on average on Spider 2.0. Furthermore, it achieves a new state-of-the-art execution accuracy of 44.8% on Spider 2.0 Lite.
π Abstract
Recent advances in LLM-based Text-to-SQL have achieved remarkable gains on public benchmarks such as BIRD and Spider. Yet, these systems struggle to scale in realistic enterprise settings with large, complex schemas, diverse SQL dialects, and expensive multi-step reasoning. Emerging agentic approaches show potential for adaptive reasoning but often suffer from inefficiency and instability-repeating interactions with databases, producing inconsistent outputs, and occasionally failing to generate valid answers. To address these challenges, we introduce Agent Semantic Memory (AgentSM), an agentic framework for Text-to-SQL that builds and leverages interpretable semantic memory. Instead of relying on raw scratchpads or vector retrieval, AgentSM captures prior execution traces-or synthesizes curated ones-as structured programs that directly guide future reasoning. This design enables systematic reuse of reasoning paths, which allows agents to scale to larger schemas, more complex questions, and longer trajectories efficiently and reliably. Compared to state-of-the-art systems, AgentSM achieves higher efficiency by reducing average token usage and trajectory length by 25% and 35%, respectively, on the Spider 2.0 benchmark. It also improves execution accuracy, reaching a state-of-the-art accuracy of 44.8% on the Spider 2.0 Lite benchmark.