CLARITY: A Framework and Benchmark for Conversational Language Ambiguity and Unanswerability in Interactive NL2SQL Systems

📅 2026-04-24

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Existing NL2SQL benchmarks struggle to simulate the multi-source ambiguities and unanswerable queries prevalent in real-world interactions. This work proposes the first conversational NL2SQL evaluation framework supporting both single-turn and multi-turn scenarios. Leveraging a constraint-driven pipeline, the framework automatically transforms executable SQL into natural language queries embedded with multidimensional ambiguities—including schema-level ambiguity—and synthesizes contextually coherent dialogue histories alongside schema metadata. It enables, for the first time, the systematic construction of diverse ambiguity types that reflect varied user behaviors. Experiments on Spider and BIRD reveal that state-of-the-art NL2SQL systems suffer significant performance degradation under such multidimensional ambiguities: while they can detect ambiguity, they often fail to accurately identify and resolve its root causes at the schema level.

Technology Category

Application Category

📝 Abstract

NL2SQL systems deployed in industry settings often encounter ambiguous or unanswerable queries, particularly in interactive scenarios with incomplete user clarification. Existing benchmarks typically assume a single source of ambiguity and rely on user interaction for resolution, overlooking realistic failure modes. We introduce Clarity, a framework for automatically generating an NL2SQL benchmark with multi-faceted ambiguities and diverse user behaviors across both single- and multi-turn settings. Using a constraint-driven pipeline, Clarity transforms executable SQL into ambiguous queries, augmented with grounded conversational continuations and schema-level metadata. Empirical evaluation on Spider and BIRD shows that leading NL2SQL systems, including those based on strong LLMs, suffer significant performance degradation under multi-faceted ambiguity. While these systems often detect ambiguity, they struggle to accurately localize and resolve the underlying schema-level sources. Our results highlight the need for more robust ambiguity detection and resolution in industry-grade NL2SQL systems.

Problem

Research questions and friction points this paper is trying to address.

NL2SQL

ambiguity

unanswerability

interactive systems

conversational AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

conversational NL2SQL

ambiguity generation

unanswerable queries