🤖 AI Summary
To address the significant degradation in NL2SQL performance of open-weight large language models (LLMs) under long-context database schemas, this paper proposes an efficient, long-context-oriented data augmentation framework. The method introduces a dynamic schema expansion mechanism based on cross-database schema sampling, which synthesizes realistic extended CREATE TABLE statements and representative data rows to emulate long-schema scenarios—without modifying model architecture or parameters. It integrates synthetic schema generation, cross-database schema sampling, and lightweight LLM fine-tuning. Evaluated on Spider and BIRD benchmarks, the approach substantially improves SQL generation accuracy and execution correctness under long-schema conditions. Crucially, it achieves these gains without increasing inference latency or computational overhead, demonstrating its effectiveness in enhancing LLM robustness to long-context schema inputs while preserving deployment efficiency.
📝 Abstract
Open-weight large language models (LLMs) have significantly advanced performance in the Natural Language to SQL (NL2SQL) task. However, their effectiveness diminishes when dealing with large database schemas, as the context length increases. To address this limitation, we present SQLong, a novel and efficient data augmentation framework designed to enhance LLM performance in long-context scenarios for the NL2SQL task. SQLong generates augmented datasets by extending existing database schemas with additional synthetic CREATE TABLE commands and corresponding data rows, sampled from diverse schemas in the training data. This approach effectively simulates long-context scenarios during finetuning and evaluation. Through experiments on the Spider and BIRD datasets, we demonstrate that LLMs finetuned with SQLong-augmented data significantly outperform those trained on standard datasets. These imply SQLong's practical implementation and its impact on improving NL2SQL capabilities in real-world settings with complex database schemas.