SynSQL: Synthesizing Relational Databases for Robust Evaluation of Text-to-SQL Systems

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

139K/year

🤖 AI Summary

Current text-to-SQL system evaluations rely on a single static database, which fails to capture model robustness across diverse data instances and may introduce significant bias. This work proposes SynSQL, a novel framework that leverages large language models to directly generate semantically consistent and schema-aligned relational test data from natural language questions. SynSQL formulates database construction as a structured generation task governed by semantic and relational constraints, comprising three stages: schema selection, question-guided data synthesis, and constraint-aware iterative refinement. Experiments on Spider, BIRD, and Spider 2.0 demonstrate that databases generated by SynSQL reduce the performance of ten state-of-the-art models by 3–14%, effectively uncovering errors masked by static evaluation and substantially enhancing assessment reliability and stress-testing capability.

📝 Abstract

Evaluating text-to-SQL systems remains largely fragile: correctness is typically judged by executing predicted and gold SQL queries on a single static database, even though the same queries may behave differently under alternative database instances. This raises a broader language modeling question: Can large language models synthesize semantically meaningful, schema-consistent relational data directly from a natural language question? If so, such generation can serve as a controlled mechanism for stress-testing text-to-SQL systems beyond fixed benchmark databases. We introduce SynSQL, a framework that synthesizes test databases conditioned on question-schema alignment rather than gold SQL queries. SynSQL decomposes the task into three stages: (1) schema selection, (2) question-guided data synthesis, and (3) constraint-aware critique with iterative refinement, framing database construction as structured generation under semantic and relational constraints. Across ten text-to-SQL models on Spider, BIRD, and Spider 2.0, SynSQL-generated databases reveal performance drops of 3-14% compared to static evaluation, exposing errors masked by benchmark artifacts. We further analyze generation quality, constraint adherence, and failure modes, highlighting both the promise and limitations of LLMs in structured data synthesis. Our findings position synthetic database generation as a new lens for studying LLM reasoning, controllability, and robustness in structured environments.

Problem

Research questions and friction points this paper is trying to address.

text-to-SQL

evaluation robustness

relational databases

database synthesis

benchmark artifacts

Innovation

Methods, ideas, or system contributions that make the work stand out.

SynSQL

text-to-SQL

synthetic database generation