PolySQL: Scaling Text-to-SQL Evaluation Across SQL Dialects via Automated Backend Isomorphism

๐Ÿ“… 2026-05-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

164K/year
๐Ÿค– AI Summary
Current Text-to-SQL evaluation predominantly relies on SQLite, which fails to reflect modelsโ€™ true performance across other SQL dialects and introduces significant evaluation bias. To address this limitation, this work proposes a translation-free dual-execution evaluation framework that enables 100% query coverage in cross-dialect assessment by concurrently executing the original and target-dialect queries and comparing their standardized execution results. This approach circumvents errors inherent in traditional query translation, substantially enhancing evaluation fidelity. Leveraging three newly constructed cross-dialect datasets, experiments reveal an average accuracy drop of 10.1% when models trained on SQLite are applied to other dialects, with 61% of errors attributable to logical discrepancies and only 8% stemming from syntactic differences.
๐Ÿ“ Abstract
SQL dialects vary in syntax, types, and functions across database engines. Text-to-SQL benchmarks, however, predominantly support only SQLite. This creates a critical evaluation gap: cross-dialect evaluation reveals weak per-query agreement (Cohen's ), showing that SQLite performance is an unreliable proxy for other dialects. Yet such evaluation remains prohibitively difficult: existing approaches either require expensive manual query transpilation or rely on tools that often fail on complex SQL. To close this gap, we introduce PolySQL, a novel dual-execution method that eliminates the need for query transpilation by comparing normalized execution results. Notably, our approach achieves higher evaluation fidelity than query transpilation with 100% query coverage. PolySQL comprises three datasets, enabling the first large-scale cross-dialect study. Our study reveals a 10.1% average accuracy drop from SQLite to other dialects and identifies a significant dialect difficulty hierarchy. We find this degradation stems from logical rather than syntactic errors (61% vs. 8%). We release our framework code and leaderboard to enable rigorous dialect-robust evaluation.
Problem

Research questions and friction points this paper is trying to address.

Text-to-SQL
SQL dialects
cross-dialect evaluation
evaluation gap
query transpilation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-to-SQL
SQL dialects
cross-dialect evaluation
query transpilation
execution-based evaluation
๐Ÿ”Ž Similar Papers