SQLong: Enhanced NL2SQL for Longer Contexts with LLMs

📅 2025-02-23

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

To address the significant degradation in NL2SQL performance of open-weight large language models (LLMs) under long-context database schemas, this paper proposes an efficient, long-context-oriented data augmentation framework. The method introduces a dynamic schema expansion mechanism based on cross-database schema sampling, which synthesizes realistic extended CREATE TABLE statements and representative data rows to emulate long-schema scenarios—without modifying model architecture or parameters. It integrates synthetic schema generation, cross-database schema sampling, and lightweight LLM fine-tuning. Evaluated on Spider and BIRD benchmarks, the approach substantially improves SQL generation accuracy and execution correctness under long-schema conditions. Crucially, it achieves these gains without increasing inference latency or computational overhead, demonstrating its effectiveness in enhancing LLM robustness to long-context schema inputs while preserving deployment efficiency.

Technology Category

Application Category

📝 Abstract

Open-weight large language models (LLMs) have significantly advanced performance in the Natural Language to SQL (NL2SQL) task. However, their effectiveness diminishes when dealing with large database schemas, as the context length increases. To address this limitation, we present SQLong, a novel and efficient data augmentation framework designed to enhance LLM performance in long-context scenarios for the NL2SQL task. SQLong generates augmented datasets by extending existing database schemas with additional synthetic CREATE TABLE commands and corresponding data rows, sampled from diverse schemas in the training data. This approach effectively simulates long-context scenarios during finetuning and evaluation. Through experiments on the Spider and BIRD datasets, we demonstrate that LLMs finetuned with SQLong-augmented data significantly outperform those trained on standard datasets. These imply SQLong's practical implementation and its impact on improving NL2SQL capabilities in real-world settings with complex database schemas.

Problem

Research questions and friction points this paper is trying to address.

Enhance NL2SQL for long contexts

Address performance drop with large schemas

Improve LLM effectiveness in real-world settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs enhanced for NL2SQL

Data augmentation for long contexts

Synthetic schema extension

🔎 Similar Papers

A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?