🤖 AI Summary
Large language models (LLMs) often produce semantically correct yet structurally inconsistent SQL queries in Text-to-SQL tasks due to variations in input phrasing or schema formulations, revealing a critical issue of structural reliability. This work proposes SQLStructEval, a novel framework that systematically evaluates the structural stability of LLM-generated SQL for the first time. By leveraging normalized abstract syntax trees (ASTs), the framework uncovers the pervasive structural diversity among queries sharing identical semantics. Building on these insights, the authors introduce a compilation-inspired structured generation approach, which significantly improves both execution accuracy and structural consistency on the Spider benchmark. This study establishes a new paradigm for generating reliable and structurally coherent SQL queries.
📝 Abstract
Despite strong performance on Text-to-SQL benchmarks, it remains unclear whether LLM-generated SQL programs are structurally reliable. In this work, we investigate the structural behavior of LLM-generated SQL queries and introduce SQLStructEval, a framework for analyzing program structures through canonical abstract syntax tree (AST) representations. Our experiments on the Spider benchmark show that modern LLMs often produce structurally diverse queries for the same input, even when execution results are correct, and that such variance is frequently triggered by surface-level input changes such as paraphrases or schema presentation. We further show that generating queries in a structured space via a compile-style pipeline can improve both execution accuracy and structural consistency. These findings suggest that structural reliability is a critical yet overlooked dimension for evaluating LLM-based program generation systems. Our code is available at https://anonymous.4open.science/r/StructEval-2435.