🤖 AI Summary
This work investigates the impact of reasoning capability on small-scale LLMs in Text-to-SQL tasks—particularly those involving multi-table joins and multi-hop reasoning. To address the limitations of zero-shot learning (ZSL) and supervised fine-tuning (SFT) for complex reasoning, we propose a two-stage training paradigm: first, SFT guided by step-by-step reasoning trajectories; second, reinforcement learning (RL) optimized for execution accuracy as the reward signal—thereby balancing reasoning generalizability and SQL generation fidelity. We present the first systematic quantification of the correlation between reasoning ability and Text-to-SQL performance. Evaluated on four benchmarks including Bird, our 7B-parameter Qwen-Coder-2.5 model, after applying this method, achieves substantial gains over both ZSL and SFT-only baselines in zero-shot and fine-tuned settings, and matches the performance of百亿-scale models on multi-table and multi-hop queries.
📝 Abstract
Large Language Models (LLMs) have shown impressive capabilities in transforming natural language questions about relational databases into SQL queries. Despite recent improvements, small LLMs struggle to handle questions involving multiple tables and complex SQL patterns under a Zero-Shot Learning (ZSL) setting. Supervised Fine-Tuning (SFT) partially compensate the knowledge deficits in pretrained models but falls short while dealing with queries involving multi-hop reasoning. To bridge this gap, different LLM training strategies to reinforce reasoning capabilities have been proposed, ranging from leveraging a thinking process within ZSL, including reasoning traces in SFT, or adopt Reinforcement Learning (RL) strategies. However, the influence of reasoning on Text2SQL performance is still largely unexplored. This paper investigates to what extent LLM reasoning capabilities influence their Text2SQL performance on four benchmark datasets. To this end, it considers the following LLM settings: (1) ZSL, including general-purpose reasoning or not; (2) SFT, with and without task-specific reasoning traces; (3) RL, leveraging execution accuracy as primary reward function; (4) SFT+RL, i.e, a two-stage approach that combines SFT and RL. The results show that general-purpose reasoning under ZSL proves to be ineffective in tackling complex Text2SQL cases. Small LLMs benefit from SFT with reasoning much more than larger ones, bridging the gap of their (weaker) model pretraining. RL is generally beneficial across all tested models and datasets, particularly when SQL queries involve multi-hop reasoning and multiple tables. Small LLMs with SFT+RL excel on most complex datasets thanks to a strategic balance between generality of the reasoning process and optimization of the execution accuracy. Thanks to RL, the7B Qwen-Coder-2.5 model performs on par with 100+ Billion ones on the Bird dataset.