Rethinking Text-to-SQL: Dynamic Multi-turn SQL Interaction for Real-world Database Exploration

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Text-to-SQL systems excel in static, single-turn settings but struggle with realistic multi-turn interactions where user intent dynamically evolves—common in financial and business analytics—and lack dedicated evaluation benchmarks. To address this, we propose DySQL-Bench, the first benchmark for dynamic interactive Text-to-SQL. It employs a structured schema-guided LLM-based task generation pipeline, featuring two-stage automated synthesis, interaction filtering, and expert validation to guarantee 100% executable and semantically correct SQL. A novel LLM-simulated user enables closed-loop interactive evaluation. DySQL-Bench spans 13 domains and comprises 1,072 high-quality multi-turn tasks. Experimental results show that GPT-4o achieves only 58.34% overall accuracy and 23.81% Pass@5, underscoring both the challenge of dynamic interaction and the benchmark’s utility for advancing robust, interactive semantic parsing.

Technology Category

Application Category

📝 Abstract
Recent advances in Text-to-SQL have achieved strong results in static, single-turn tasks, where models generate SQL queries from natural language questions. However, these systems fall short in real-world interactive scenarios, where user intents evolve and queries must be refined over multiple turns. In applications such as finance and business analytics, users iteratively adjust query constraints or dimensions based on intermediate results. To evaluate such dynamic capabilities, we introduce DySQL-Bench, a benchmark assessing model performance under evolving user interactions. Unlike previous manually curated datasets, DySQL-Bench is built through an automated two-stage pipeline of task synthesis and verification. Structured tree representations derived from raw database tables guide LLM-based task generation, followed by interaction-oriented filtering and expert validation. Human evaluation confirms 100% correctness of the synthesized data. We further propose a multi-turn evaluation framework simulating realistic interactions among an LLM-simulated user, the model under test, and an executable database. The model must adapt its reasoning and SQL generation as user intents change. DySQL-Bench covers 13 domains across BIRD and Spider 2 databases, totaling 1,072 tasks. Even GPT-4o attains only 58.34% overall accuracy and 23.81% on the Pass@5 metric, underscoring the benchmark's difficulty. All code and data are released at https://github.com/Aurora-slz/Real-World-SQL-Bench .
Problem

Research questions and friction points this paper is trying to address.

Addressing dynamic multi-turn SQL generation for evolving user intents
Evaluating model adaptation in real-world interactive database exploration
Benchmarking SQL systems under iterative query refinement scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic multi-turn SQL interaction for database exploration
Automated two-stage pipeline for benchmark synthesis
Multi-turn evaluation framework simulating realistic user interactions
🔎 Similar Papers
Linzhuang Sun
Linzhuang Sun
University of Chinese Academy of Sciences
Multimodal Reasoning
T
Tianyu Guo
Peking University, Beijing, China
H
Hao Liang
Peking University, Beijing, China
Yuying Li
Yuying Li
Professor Cheriton School of Computer Science, University of Waterloo
optimizationscientific computingdata miningcomputational finance
Q
Qifeng Cai
Peking University, Beijing, China
J
Jingxuan Wei
University of Chinese Academy of Sciences, Beijing, China
B
Bihui Yu
University of Chinese Academy of Sciences, Beijing, China
Wentao Zhang
Wentao Zhang
Institute of Physics, Chinese Academy of Sciences
photoemissionsuperconductivitycupratehtsctime-resolved
B
Bin Cui
Peking University, Beijing, China