Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current Text-to-SQL methods still lag significantly behind human experts on complex benchmarks such as BIRD, and mainstream test-time scaling (TTS) strategies lack systematic integration of internal reasoning modeling and multi-path generation. This paper proposes Orchestrated Test-Time Scaling (OTTS), a framework that enhances performance through three synergistic mechanisms: (1) reinforcement learning–driven intrinsic reasoning augmentation; (2) iterative sequential SQL optimization; and (3) parallel multi-path SQL generation coupled with tournament-based selection. OTTS is the first approach to systematically unify large language models’ chain-of-thought reasoning with structured multi-path exploration, enabling plug-and-play deployment and cross-database generalization. Evaluated on the BIRD benchmark, OTTS achieves an execution accuracy of 81.67%, ranking first on the official leaderboard and substantially narrowing the performance gap with human experts.

Technology Category

Application Category

📝 Abstract
State-of-the-art (SOTA) Text-to-SQL methods still lag significantly behind human experts on challenging benchmarks like BIRD. Current approaches that explore test-time scaling lack an orchestrated strategy and neglect the model's internal reasoning process. To bridge this gap, we introduce Agentar-Scale-SQL, a novel framework leveraging scalable computation to improve performance. Agentar-Scale-SQL implements an Orchestrated Test-Time Scaling strategy that synergistically combines three distinct perspectives: i) Internal Scaling via RL-enhanced Intrinsic Reasoning, ii) Sequential Scaling through Iterative Refinement, and iii) Parallel Scaling using Diverse Synthesis and Tournament Selection. Agentar-Scale-SQL is a general-purpose framework designed for easy adaptation to new databases and more powerful language models. Extensive experiments show that Agentar-Scale-SQL achieves SOTA performance on the BIRD benchmark, reaching 81.67% execution accuracy on the test set and ranking first on the official leaderboard, demonstrating an effective path toward human-level performance.
Problem

Research questions and friction points this paper is trying to address.

Improving Text-to-SQL performance on challenging benchmarks like BIRD
Addressing lack of orchestrated strategy in test-time scaling approaches
Enhancing model reasoning through synergistic multi-perspective scaling methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Orchestrated test-time scaling strategy for Text-to-SQL
Combines internal, sequential, and parallel scaling methods
Leverages scalable computation to improve model performance
P
Pengfei Wang
Ant Digital Technologies, Ant Group
B
Baolin Sun
Ant Digital Technologies, Ant Group
X
Xuemei Dong
Ant Digital Technologies, Ant Group
Y
Yaxun Dai
Soochow University
H
Hongwei Yuan
Zhejiang University
M
Mengdie Chu
Ant Digital Technologies, Ant Group
Y
Yingqi Gao
Ant Digital Technologies, Ant Group
X
Xiang Qi
Ant Digital Technologies, Ant Group
P
Peng Zhang
Ant Digital Technologies, Ant Group
Ying Yan
Ying Yan
Microsoft Research
Big Data Management