Think2SQL: Reinforce LLM Reasoning Capabilities for Text2SQL

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work investigates the impact of reasoning capability on small-scale LLMs in Text-to-SQL tasks—particularly those involving multi-table joins and multi-hop reasoning. To address the limitations of zero-shot learning (ZSL) and supervised fine-tuning (SFT) for complex reasoning, we propose a two-stage training paradigm: first, SFT guided by step-by-step reasoning trajectories; second, reinforcement learning (RL) optimized for execution accuracy as the reward signal—thereby balancing reasoning generalizability and SQL generation fidelity. We present the first systematic quantification of the correlation between reasoning ability and Text-to-SQL performance. Evaluated on four benchmarks including Bird, our 7B-parameter Qwen-Coder-2.5 model, after applying this method, achieves substantial gains over both ZSL and SFT-only baselines in zero-shot and fine-tuned settings, and matches the performance of百亿-scale models on multi-table and multi-hop queries.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have shown impressive capabilities in transforming natural language questions about relational databases into SQL queries. Despite recent improvements, small LLMs struggle to handle questions involving multiple tables and complex SQL patterns under a Zero-Shot Learning (ZSL) setting. Supervised Fine-Tuning (SFT) partially compensate the knowledge deficits in pretrained models but falls short while dealing with queries involving multi-hop reasoning. To bridge this gap, different LLM training strategies to reinforce reasoning capabilities have been proposed, ranging from leveraging a thinking process within ZSL, including reasoning traces in SFT, or adopt Reinforcement Learning (RL) strategies. However, the influence of reasoning on Text2SQL performance is still largely unexplored. This paper investigates to what extent LLM reasoning capabilities influence their Text2SQL performance on four benchmark datasets. To this end, it considers the following LLM settings: (1) ZSL, including general-purpose reasoning or not; (2) SFT, with and without task-specific reasoning traces; (3) RL, leveraging execution accuracy as primary reward function; (4) SFT+RL, i.e, a two-stage approach that combines SFT and RL. The results show that general-purpose reasoning under ZSL proves to be ineffective in tackling complex Text2SQL cases. Small LLMs benefit from SFT with reasoning much more than larger ones, bridging the gap of their (weaker) model pretraining. RL is generally beneficial across all tested models and datasets, particularly when SQL queries involve multi-hop reasoning and multiple tables. Small LLMs with SFT+RL excel on most complex datasets thanks to a strategic balance between generality of the reasoning process and optimization of the execution accuracy. Thanks to RL, the7B Qwen-Coder-2.5 model performs on par with 100+ Billion ones on the Bird dataset.

Problem

Research questions and friction points this paper is trying to address.

Enhancing small LLMs' Text2SQL performance in multi-table queries

Exploring reasoning's impact on Text2SQL under ZSL, SFT, RL settings

Bridging performance gap between small and large LLMs via SFT+RL

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning optimizes SQL execution accuracy

Supervised Fine-Tuning includes reasoning traces

Combines SFT and RL for complex queries

🔎 Similar Papers

A Survey on Employing Large Language Models for Text-to-SQL Tasks