🤖 AI Summary
Current Text-to-SQL evaluation relies on static test databases, leading to false positives when syntactically distinct but semantically equivalent SQL queries yield identical execution results—thereby inflating accuracy estimates. To address this, we propose SpotIt, the first evaluation paradigm integrating formal bounded equivalence verification into Text-to-SQL assessment. SpotIt actively synthesizes minimal counterexample databases that distinguish semantic discrepancies between generated and reference SQL queries—exposing implicit semantic deviations masked by conventional result-set matching. We extend existing verifiers to support a rich subset of SQL (including joins, aggregations, and nested subqueries) by unifying constraint solving with concrete instance generation for efficient equivalence checking. Evaluating ten state-of-the-art models on the BIRD benchmark, SpotIt reveals an average accuracy overestimation of 12.7%, demonstrating its effectiveness and necessity in assessing true semantic correctness.
📝 Abstract
Community-driven Text-to-SQL evaluation platforms play a pivotal role in tracking the state of the art of Text-to-SQL performance. The reliability of the evaluation process is critical for driving progress in the field. Current evaluation methods are largely test-based, which involves comparing the execution results of a generated SQL query and a human-labeled ground-truth on a static test database. Such an evaluation is optimistic, as two queries can coincidentally produce the same output on the test database while actually being different. In this work, we propose a new alternative evaluation pipeline, called SpotIt, where a formal bounded equivalence verification engine actively searches for a database that differentiates the generated and ground-truth SQL queries. We develop techniques to extend existing verifiers to support a richer SQL subset relevant to Text-to-SQL. A performance evaluation of ten Text-to-SQL methods on the high-profile BIRD dataset suggests that test-based methods can often overlook differences between the generated query and the ground-truth. Further analysis of the verification results reveals a more complex picture of the current Text-to-SQL evaluation.