Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically evaluates the practical efficacy of test-time scaling (TTS) strategies for Text-to-SQL generation. Addressing the multi-objective trade-off among accuracy, latency, and resource consumption in industrial deployment, we establish a comprehensive, real-world-oriented evaluation framework and benchmark six lightweight TTS strategies across four large language models—including two reasoning-augmented variants—on the BIRD Mini-Dev dataset. Results show that *divide-and-conquer* prompting and few-shot exemplars deliver robust, generalizable accuracy gains; model selection exerts significantly greater impact on performance than TTS strategy choice; and complex multi-step workflows yield no consistent improvement—some reasoning-enhanced models even underperform generic baselines. Crucially, this study provides the first empirical characterization of diminishing returns in TTS for Text-to-SQL, establishing marginal benefit boundaries. The findings offer evidence-based design principles and practical guidelines for building efficient, cost-effective SQL generation systems.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly powering Text-to-SQL (Text2SQL) systems, enabling non-expert users to query industrial databases using natural language. While test-time scaling strategies have shown promise in LLM-based solutions, their effectiveness in real-world applications, especially with the latest reasoning models, remains uncertain. In this work, we benchmark six lightweight, industry-oriented test-time scaling strategies and four LLMs, including two reasoning models, evaluating their performance on the BIRD Mini-Dev benchmark. Beyond standard accuracy metrics, we also report inference latency and token consumption, providing insights relevant for practical system deployment. Our findings reveal that Divide-and-Conquer prompting and few-shot demonstrations consistently enhance performance for both general-purpose and reasoning-focused LLMs. However, introducing additional workflow steps yields mixed results, and base model selection plays a critical role. This work sheds light on the practical trade-offs between accuracy, efficiency, and complexity when deploying Text2SQL systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating test-time scaling strategies for Text2SQL systems
Benchmarking lightweight methods on BIRD benchmark with LLMs
Analyzing accuracy-efficiency tradeoffs in practical Text2SQL deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarks lightweight test-time scaling strategies
Evaluates Divide-and-Conquer prompting effectiveness
Analyzes accuracy-efficiency trade-offs in deployment
🔎 Similar Papers
No similar papers found.