π€ AI Summary
This work addresses the limitations of existing Text-to-SQL evaluation methods, which often fail to capture semantic discrepancies between generated and reference SQL queries, particularly in the absence of real database constraints. To overcome this, the authors propose a bounded equivalence verification framework that actively searches for database instances capable of distinguishing the semantics of two queries. The core innovation lies in integrating rule-driven constraint mining with large language modelβbased validation, ensuring that the generated counterexamples are both semantically discriminative and realistic in practical deployment scenarios. Experiments on the BIRD dataset demonstrate that the proposed approach efficiently uncovers numerous semantic errors missed by conventional evaluation metrics, thereby substantially enhancing the validity and fidelity of Text-to-SQL system assessment.
π Abstract
We present SpotIt+, an open-source tool for evaluating Text-to-SQL systems via bounded equivalence verification. Given a generated SQL query and the ground truth, SpotIt+ actively searches for database instances that differentiate the two queries. To ensure that the generated counterexamples reflect practically relevant discrepancies, we introduce a constraint-mining pipeline that combines rule-based specification mining over example databases with LLM-based validation. Experimental results on the BIRD dataset show that the mined constraints enable SpotIt+ to generate more realistic differentiating databases, while preserving its ability to efficiently uncover numerous discrepancies between generated and gold SQL queries that are missed by standard test-based evaluation.