🤖 AI Summary
To address the dual requirements of high accuracy and efficiency in NL2SQL for real-world business applications—and the inherent trade-off between domain customization and model lightweighting—this paper proposes “distilled customization,” a novel paradigm integrating knowledge distillation from large language models (LLMs) with client- and domain-specific adaptation. Specifically, a teacher LLM generates high-quality synthetic NL2SQL data, which undergoes multi-stage filtering before being used for instruction fine-tuning and domain-aware prompt engineering to empower lightweight open-source LLMs. Evaluated on multiple public benchmarks, our approach achieves an average 36% accuracy gain; on internal client benchmarks, it improves accuracy by 22.6%, consistently outperforming three mainstream open-source baselines while reducing inference overhead by an order of magnitude. Crucially, this work marks the first demonstration of a compact student model surpassing its ten-times-larger teacher model in NL2SQL performance.
📝 Abstract
The growing adoption of large language models (LLMs) in business applications has amplified interest in Natural Language to SQL (NL2SQL) solutions, in which there is competing demand for high performance and efficiency. Domain- and customer-specific requirements further complicate the problem. To address this conundrum, we introduce Distill-C, a distilled customization framework tailored for NL2SQL tasks. Distill-C utilizes large teacher LLMs to produce high-quality synthetic data through a robust and scalable pipeline. Finetuning smaller and open-source LLMs on this synthesized data enables them to rival or outperform teacher models an order of magnitude larger. Evaluated on multiple challenging benchmarks, Distill-C achieves an average improvement of 36% in execution accuracy compared to the base models from three distinct LLM families. Additionally, on three internal customer benchmarks, Distill-C demonstrates a 22.6% performance improvement over the base models. Our results demonstrate that Distill-C is an effective, high-performing and generalizable approach for deploying lightweight yet powerful NL2SQL models, delivering exceptional accuracies while maintaining low computational cost.