🤖 AI Summary
This study addresses the pervasive issue of human annotation errors in mainstream text-to-SQL benchmarks, such as BIRD and Spider 2.0-Snow, which severely compromise model evaluation and leaderboard reliability. For the first time, this work systematically quantifies these errors through expert validation, revealing alarming error rates of 52.8% in BIRD Mini-Dev and 62.8% in Spider 2.0-Snow. Re-evaluating 16 open-source agents on corrected subsets yields relative performance shifts ranging from −7% to +31%, with ranking changes of up to ±9 positions and a marked drop in Spearman correlation between original and corrected leaderboards. These findings expose significant evaluation bias in current benchmarks and establish a more trustworthy foundation for future research and deployment.
📝 Abstract
Researchers have proposed numerous text-to-SQL techniques to streamline data analytics and accelerate the development of data-driven applications. To compare these techniques and select the best one for deployment, the community depends on public benchmarks and their leaderboards. Since these benchmarks heavily rely on human annotations during question construction and answer evaluation, the validity of the annotations is crucial. In this paper, we conduct an empirical study that (i) benchmarks annotation error rates for two widely used text-to-SQL benchmarks, BIRD and Spider 2.0-Snow, and (ii) corrects a subset of the BIRD development (Dev) set to measure the impact of annotation errors on text-to-SQL agent performance and leaderboard rankings. Through expert analysis, we show that BIRD Mini-Dev and Spider 2.0-Snow have error rates of 52.8% and 62.8%, respectively. We re-evaluate all 16 open-source agents from the BIRD leaderboard on both the original and the corrected BIRD Dev subsets. We show that performance changes range from -7% to 31% (in relative terms) and rank changes range from $-9$ to $+9$ positions. We further assess whether these impacts generalize to the full BIRD Dev set. We find that the rankings of agents on the uncorrected subset correlate strongly with those on the full Dev set (Spearman's $r_s$=0.85, $p$=3.26e-5), whereas they correlate weakly with those on the corrected subset (Spearman's $r_s$=0.32, $p$=0.23). These findings show that annotation errors can significantly distort reported performance and rankings, potentially misguiding research directions or deployment choices. Our code and data are available at https://github.com/uiuc-kang-lab/text_to_sql_benchmarks.