π€ AI Summary
This work addresses the challenge of efficiently evaluating Text-to-SQL (Text2SQL) models in dynamic database environments where schema or content changes occur and labeled SQL queries are unavailable. To overcome this limitation, the authors propose FusionSQL, a novel method that estimates model accuracy without requiring ground-truth labels. FusionSQL integrates large language model output analysis, distribution discrepancy detection, and self-supervised evaluation techniques to enable, for the first time, performance monitoring of arbitrary Text2SQL models on unlabeled data. Experimental results demonstrate that FusionSQLβs estimates exhibit strong correlation with actual model accuracy across diverse query types and real-world scenarios. The approach effectively supports pre-deployment validation and continuous performance monitoring, thereby breaking the longstanding dependency on costly human annotations for evaluation.
π Abstract
Recent advances in large language models has strengthened Text2SQL systems that translate natural language questions into database queries. A persistent deployment challenge is to assess a newly trained Text2SQL system on an unseen and unlabeled dataset when no verified answers are available. This situation arises frequently because database content and structure evolve, privacy policies slow manual review, and carefully written SQL labels are costly and time-consuming. Without timely evaluation, organizations cannot approve releases or detect failures early. FusionSQL addresses this gap by working with any Text2SQL models and estimating accuracy without reference labels, allowing teams to measure quality on unseen and unlabeled datasets. It analyzes patterns in the system's own outputs to characterize how the target dataset differs from the material used during training. FusionSQL supports pre-release checks, continuous monitoring of new databases, and detection of quality decline. Experiments across diverse application settings and question types show that FusionSQL closely follows actual accuracy and reliably signals emerging issues. Our code is available at https://github.com/phkhanhtrinh23/FusionSQL.