BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Existing multi-turn text-to-SQL evaluation benchmarks fail to capture key challenges faced by real-world database assistants—namely, ambiguous user queries, recovery from execution errors, and dynamic evolution of user intent. This paper introduces the first production-oriented multi-turn SQL evaluation paradigm. It features a hierarchical knowledge base and a function-driven user simulator, enabling dual-mode evaluation via both predefined protocols and open-ended agent interactions. The benchmark comprehensively covers CRUD operations and incorporates four novel technical components: dynamic environment coupling, executable test validation, memory grafting analysis, and interaction timeline extension—enabling fine-grained behavioral modeling. Empirical evaluation reveals that current state-of-the-art models (e.g., GPT-5) achieve only 8.67%–17.00% task completion rates, demonstrating the benchmark’s high difficulty and strong realism. It establishes a rigorous, scalable evaluation infrastructure for advancing multi-turn SQL generation research.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short by treating conversation histories as static context or limiting evaluation to read-only operations, failing to reflect production-grade database assistant challenges. We introduce BIRD-INTERACT, a benchmark that restores this realism through: (1) a comprehensive interaction environment coupling each database with a hierarchical knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from errors without human supervision; (2) two evaluation settings consisting of a pre-defined conversational protocol (c-Interact) and an open-ended agentic setting (a-Interact) where models autonomously decide when to query the user simulator or explore the environment; (3) a challenging task suite covering the full CRUD spectrum for business-intelligence and operational use cases, guarded by executable test cases. Each task features ambiguous and follow-up sub-tasks requiring dynamic interaction. The suite comprises BIRD-INTERACT-FULL (600 tasks, up to 11,796 interactions) for comprehensive performance assessment, and BIRD-INTERACT-LITE (300 tasks with simplified databases) for detailed behavioral analysis and rapid method development. Our empirical results highlight BIRD-INTERACT's difficulty: GPT-5 completes only 8.67% of tasks in c-Interact and 17.00% in a-Interact. Analysis via memory grafting and Interaction Test-time Scaling validates the importance of effective interaction for complex, dynamic text-to-SQL tasks.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-turn text-to-SQL interactions for real-world database applications

Addressing ambiguous queries and execution errors through dynamic user interactions

Creating realistic benchmarks covering full CRUD operations for business intelligence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic interaction environment with hierarchical knowledge base

Two evaluation settings with conversational and agentic modes

Comprehensive CRUD task suite with executable test cases

🔎 Similar Papers

Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL