Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work systematically evaluates the practical efficacy of test-time scaling (TTS) strategies for Text-to-SQL generation. Addressing the multi-objective trade-off among accuracy, latency, and resource consumption in industrial deployment, we establish a comprehensive, real-world-oriented evaluation framework and benchmark six lightweight TTS strategies across four large language models—including two reasoning-augmented variants—on the BIRD Mini-Dev dataset. Results show that *divide-and-conquer* prompting and few-shot exemplars deliver robust, generalizable accuracy gains; model selection exerts significantly greater impact on performance than TTS strategy choice; and complex multi-step workflows yield no consistent improvement—some reasoning-enhanced models even underperform generic baselines. Crucially, this study provides the first empirical characterization of diminishing returns in TTS for Text-to-SQL, establishing marginal benefit boundaries. The findings offer evidence-based design principles and practical guidelines for building efficient, cost-effective SQL generation systems.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly powering Text-to-SQL (Text2SQL) systems, enabling non-expert users to query industrial databases using natural language. While test-time scaling strategies have shown promise in LLM-based solutions, their effectiveness in real-world applications, especially with the latest reasoning models, remains uncertain. In this work, we benchmark six lightweight, industry-oriented test-time scaling strategies and four LLMs, including two reasoning models, evaluating their performance on the BIRD Mini-Dev benchmark. Beyond standard accuracy metrics, we also report inference latency and token consumption, providing insights relevant for practical system deployment. Our findings reveal that Divide-and-Conquer prompting and few-shot demonstrations consistently enhance performance for both general-purpose and reasoning-focused LLMs. However, introducing additional workflow steps yields mixed results, and base model selection plays a critical role. This work sheds light on the practical trade-offs between accuracy, efficiency, and complexity when deploying Text2SQL systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating test-time scaling strategies for Text2SQL systems

Benchmarking lightweight methods on BIRD benchmark with LLMs

Analyzing accuracy-efficiency tradeoffs in practical Text2SQL deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarks lightweight test-time scaling strategies

Evaluates Divide-and-Conquer prompting effectiveness

Analyzes accuracy-efficiency trade-offs in deployment

🔎 Similar Papers

No similar papers found.