BAPPA: Benchmarking Agents, Plans, and Pipelines for Automated Text-to-SQL Generation

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) face significant limitations in text-to-SQL generation, particularly due to database schema complexity and insufficient reasoning capabilities—challenges especially pronounced in smaller open-source models. To address this, we propose and systematically evaluate three multi-agent LLM pipelines: Multi-Round Discussion, Planner-Coder, and Coder-Aggregator. This work presents the first standardized benchmarking study of multi-agent collaborative architectures for text-to-SQL. Experimental results demonstrate that small models (e.g., Qwen2.5-7B-Instruct) achieve a 10.6% absolute gain in execution accuracy through three-round agent collaboration; furthermore, reasoning-capable planners substantially augment weak coders. Our best configuration attains a state-of-the-art 56.4% execution accuracy on Bird-Bench Mini-Dev, validating the effectiveness and practicality of lightweight multi-agent paradigms for enhancing SQL generation performance. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Text-to-SQL systems provide a natural language interface that can enable even laymen to access information stored in databases. However, existing Large Language Models (LLM) struggle with SQL generation from natural instructions due to large schema sizes and complex reasoning. Prior work often focuses on complex, somewhat impractical pipelines using flagship models, while smaller, efficient models remain overlooked. In this work, we explore three multi-agent LLM pipelines, with systematic performance benchmarking across a range of small to large open-source models: (1) Multi-agent discussion pipeline, where agents iteratively critique and refine SQL queries, and a judge synthesizes the final answer; (2) Planner-Coder pipeline, where a thinking model planner generates stepwise SQL generation plans and a coder synthesizes queries; and (3) Coder-Aggregator pipeline, where multiple coders independently generate SQL queries, and a reasoning agent selects the best query. Experiments on the Bird-Bench Mini-Dev set reveal that Multi-Agent discussion can improve small model performance, with up to 10.6% increase in Execution Accuracy for Qwen2.5-7b-Instruct seen after three rounds of discussion. Among the pipelines, the LLM Reasoner-Coder pipeline yields the best results, with DeepSeek-R1-32B and QwQ-32B planners boosting Gemma 3 27B IT accuracy from 52.4% to the highest score of 56.4%. Codes are available at https://github.com/treeDweller98/bappa-sql.
Problem

Research questions and friction points this paper is trying to address.

Addresses SQL generation challenges from natural language using large databases
Compares multi-agent pipelines to improve small model Text-to-SQL performance
Evaluates planner-coder architectures for complex SQL query reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent discussion pipeline iteratively refines SQL queries
Planner-Coder pipeline generates stepwise SQL generation plans
Coder-Aggregator pipeline selects best query from multiple coders
🔎 Similar Papers
No similar papers found.
Fahim Ahmed
Fahim Ahmed
Ph.D. Candidate, University of South Carolina
Freight systemsOptimizationTraffic SafetyPavement management
M
Md Mubtasim Ahasan
Center for Computational & Data Sciences, Independent University, Bangladesh
J
Jahir Sadik Monon
Center for Computational & Data Sciences, Independent University, Bangladesh
Muntasir Wahed
Muntasir Wahed
University of Illinois Urbana-Champaign
Multimodal LearningVision Language ModelsConversational AILarge Language Models
M
M Ashraful Amin
Center for Computational & Data Sciences, Independent University, Bangladesh
A
A K M Mahbubur Rahman
Center for Computational & Data Sciences, Independent University, Bangladesh
Amin Ahsan Ali
Amin Ahsan Ali
Independent University, Bangladesh
Machine LearningData SciencemHealth