🤖 AI Summary
This work systematically evaluates the practical efficacy of test-time scaling (TTS) strategies for Text-to-SQL generation. Addressing the multi-objective trade-off among accuracy, latency, and resource consumption in industrial deployment, we establish a comprehensive, real-world-oriented evaluation framework and benchmark six lightweight TTS strategies across four large language models—including two reasoning-augmented variants—on the BIRD Mini-Dev dataset. Results show that *divide-and-conquer* prompting and few-shot exemplars deliver robust, generalizable accuracy gains; model selection exerts significantly greater impact on performance than TTS strategy choice; and complex multi-step workflows yield no consistent improvement—some reasoning-enhanced models even underperform generic baselines. Crucially, this study provides the first empirical characterization of diminishing returns in TTS for Text-to-SQL, establishing marginal benefit boundaries. The findings offer evidence-based design principles and practical guidelines for building efficient, cost-effective SQL generation systems.
📝 Abstract
Large language models (LLMs) are increasingly powering Text-to-SQL (Text2SQL) systems, enabling non-expert users to query industrial databases using natural language. While test-time scaling strategies have shown promise in LLM-based solutions, their effectiveness in real-world applications, especially with the latest reasoning models, remains uncertain. In this work, we benchmark six lightweight, industry-oriented test-time scaling strategies and four LLMs, including two reasoning models, evaluating their performance on the BIRD Mini-Dev benchmark. Beyond standard accuracy metrics, we also report inference latency and token consumption, providing insights relevant for practical system deployment. Our findings reveal that Divide-and-Conquer prompting and few-shot demonstrations consistently enhance performance for both general-purpose and reasoning-focused LLMs. However, introducing additional workflow steps yields mixed results, and base model selection plays a critical role. This work sheds light on the practical trade-offs between accuracy, efficiency, and complexity when deploying Text2SQL systems.