PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing SQL benchmarks inadequately evaluate cross-system SQL-to-SQL translation, as they are confined to a narrow set of systems (e.g., SQLite) and overlook dialectal differences—including proprietary functions, data types, and syntactic rules. To address this, we propose PARROT, the first multi-system, multi-scenario benchmark explicitly designed for SQL-to-SQL translation, covering 22 production-grade database systems. PARROT comprises 598 high-quality translation pairs derived from 38 open-source benchmarks and real-world business SQL queries, with explicit modeling of system-specific syntax, built-in functions, and type semantics. It includes variants for syntactic breadth coverage and stress testing, and features a publicly accessible leaderboard. Experimental evaluation reveals that state-of-the-art large language models achieve only 38.53% average accuracy, exposing fundamental weaknesses in cross-system semantic alignment. PARROT thus establishes a rigorous evaluation foundation and identifies critical research directions for SQL migration.

Technology Category

Application Category

📝 Abstract

Large language models (LLMS) have shown increasing effectiveness in Text-to-SQL tasks. However, another closely related problem, Cross-System SQL Translation (a.k.a., SQL-to-SQL), which adapts a query written for one database system (e.g., MySQL) into its equivalent one for another system (e.g., ClickHouse), is of great practical importance but remains underexplored. Existing SQL benchmarks are not well-suited for SQL-to-SQL evaluation, which (1) focus on a limited set of database systems (often just SQLite) and (2) cannot capture many system-specific SQL dialects (e.g., customized functions, data types, and syntax rules). Thus, in this paper, we introduce PARROT, a Practical And Realistic BenchmaRk for CrOss-System SQL Translation. PARROT comprises 598 translation pairs from 38 open-source benchmarks and real-world business services, specifically prepared to challenge system-specific SQL understanding (e.g., LLMS achieve lower than 38.53% accuracy on average). We also provide multiple benchmark variants, including PARROT-Diverse with 28,003 translations (for extensive syntax testing) and PARROT-Simple with 5,306 representative samples (for focused stress testing), covering 22 production-grade database systems. To promote future research, we release a public leaderboard and source code at: https://code4db.github.io/parrot-bench/.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for translating SQL queries between different database systems

Addressing limitations of existing benchmarks for cross-system SQL translation

Providing comprehensive testing for system-specific SQL dialects and functions

Innovation

Methods, ideas, or system contributions that make the work stand out.

PARROT benchmark for cross-system SQL translation

Includes diverse SQL pairs from real-world sources

Covers 22 production-grade database systems specifically

🔎 Similar Papers

A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?