Distill-C: Enhanced NL2SQL via Distilled Customization with LLMs

📅 2025-03-30

📈 Citations: 0

✨ Influential: 0

career value

137K/year

🤖 AI Summary

To address the dual requirements of high accuracy and efficiency in NL2SQL for real-world business applications—and the inherent trade-off between domain customization and model lightweighting—this paper proposes “distilled customization,” a novel paradigm integrating knowledge distillation from large language models (LLMs) with client- and domain-specific adaptation. Specifically, a teacher LLM generates high-quality synthetic NL2SQL data, which undergoes multi-stage filtering before being used for instruction fine-tuning and domain-aware prompt engineering to empower lightweight open-source LLMs. Evaluated on multiple public benchmarks, our approach achieves an average 36% accuracy gain; on internal client benchmarks, it improves accuracy by 22.6%, consistently outperforming three mainstream open-source baselines while reducing inference overhead by an order of magnitude. Crucially, this work marks the first demonstration of a compact student model surpassing its ten-times-larger teacher model in NL2SQL performance.

Technology Category

Application Category

📝 Abstract

The growing adoption of large language models (LLMs) in business applications has amplified interest in Natural Language to SQL (NL2SQL) solutions, in which there is competing demand for high performance and efficiency. Domain- and customer-specific requirements further complicate the problem. To address this conundrum, we introduce Distill-C, a distilled customization framework tailored for NL2SQL tasks. Distill-C utilizes large teacher LLMs to produce high-quality synthetic data through a robust and scalable pipeline. Finetuning smaller and open-source LLMs on this synthesized data enables them to rival or outperform teacher models an order of magnitude larger. Evaluated on multiple challenging benchmarks, Distill-C achieves an average improvement of 36% in execution accuracy compared to the base models from three distinct LLM families. Additionally, on three internal customer benchmarks, Distill-C demonstrates a 22.6% performance improvement over the base models. Our results demonstrate that Distill-C is an effective, high-performing and generalizable approach for deploying lightweight yet powerful NL2SQL models, delivering exceptional accuracies while maintaining low computational cost.

Problem

Research questions and friction points this paper is trying to address.

Enhance NL2SQL performance and efficiency using distilled LLMs

Address domain-specific NL2SQL challenges with customized solutions

Achieve high accuracy with low computational cost in NL2SQL

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distilled customization framework for NL2SQL

Synthetic data generation via teacher LLMs

Finetuning smaller models to outperform teachers

🔎 Similar Papers

A Survey on Employing Large Language Models for Text-to-SQL Tasks