BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multi-turn text-to-SQL evaluation benchmarks fail to capture key challenges faced by real-world database assistants—namely, ambiguous user queries, recovery from execution errors, and dynamic evolution of user intent. This paper introduces the first production-oriented multi-turn SQL evaluation paradigm. It features a hierarchical knowledge base and a function-driven user simulator, enabling dual-mode evaluation via both predefined protocols and open-ended agent interactions. The benchmark comprehensively covers CRUD operations and incorporates four novel technical components: dynamic environment coupling, executable test validation, memory grafting analysis, and interaction timeline extension—enabling fine-grained behavioral modeling. Empirical evaluation reveals that current state-of-the-art models (e.g., GPT-5) achieve only 8.67%–17.00% task completion rates, demonstrating the benchmark’s high difficulty and strong realism. It establishes a rigorous, scalable evaluation infrastructure for advancing multi-turn SQL generation research.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short by treating conversation histories as static context or limiting evaluation to read-only operations, failing to reflect production-grade database assistant challenges. We introduce BIRD-INTERACT, a benchmark that restores this realism through: (1) a comprehensive interaction environment coupling each database with a hierarchical knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from errors without human supervision; (2) two evaluation settings consisting of a pre-defined conversational protocol (c-Interact) and an open-ended agentic setting (a-Interact) where models autonomously decide when to query the user simulator or explore the environment; (3) a challenging task suite covering the full CRUD spectrum for business-intelligence and operational use cases, guarded by executable test cases. Each task features ambiguous and follow-up sub-tasks requiring dynamic interaction. The suite comprises BIRD-INTERACT-FULL (600 tasks, up to 11,796 interactions) for comprehensive performance assessment, and BIRD-INTERACT-LITE (300 tasks with simplified databases) for detailed behavioral analysis and rapid method development. Our empirical results highlight BIRD-INTERACT's difficulty: GPT-5 completes only 8.67% of tasks in c-Interact and 17.00% in a-Interact. Analysis via memory grafting and Interaction Test-time Scaling validates the importance of effective interaction for complex, dynamic text-to-SQL tasks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-turn text-to-SQL interactions for real-world database applications
Addressing ambiguous queries and execution errors through dynamic user interactions
Creating realistic benchmarks covering full CRUD operations for business intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic interaction environment with hierarchical knowledge base
Two evaluation settings with conversational and agentic modes
Comprehensive CRUD task suite with executable test cases
🔎 Similar Papers
No similar papers found.
N
Nan Huo
The University of Hong Kong
Xiaohan Xu
Xiaohan Xu
The University of Hong Kong
Knowledge GraphLarge Language ModelText-to-SQL
J
Jinyang Li
The University of Hong Kong
P
Per Jacobsson
Google Cloud
S
Shipei Lin
The BIRD Team
Bowen Qin
Bowen Qin
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Binyuan Hui
Binyuan Hui
Qwen Team, Alibaba Group
Large Language ModelsCodeLLMsReasoningAgent
X
Xiaolong Li
The University of Hong Kong
G
Ge Qu
The University of Hong Kong
Shuzheng Si
Shuzheng Si
Tsinghua University
Natural Language ProcessingLarge Language Models
L
Linheng Han
The BIRD Team
E
Edward Alexander
The BIRD Team
X
Xintong Zhu
The BIRD Team
Rui Qin
Rui Qin
Tsighua University
R
Ruihan Yu
The BIRD Team
Y
Yiyao Jin
The BIRD Team
F
Feige Zhou
The BIRD Team
W
Weihao Zhong
The BIRD Team
Y
Yun Chen
The BIRD Team
Hongyu Liu
Hongyu Liu
HKUST
Computer Vision
Chenhao Ma
Chenhao Ma
The Chinese University of Hong Kong, Shenzhen
Data managementdata mining
Fatma Ozcan
Fatma Ozcan
Google
Big dataquery processing and optimization
Y
Yannis Papakonstantinou
Google Cloud
Reynold Cheng
Reynold Cheng
ACM Distinguished Member, HKU Computer Science Professor
Data UncertaintyGraph DatabasesData Science for Social Goods