🤖 AI Summary
To address insufficient reasoning accuracy of large language models (LLMs) in Text-to-SQL tasks, this paper proposes a Chain-of-Thought (CoT)-enhanced Direct Preference Optimization (DPO) framework that relies solely on SQL execution feedback—eliminating the need for reward modeling or human annotations. Methodologically, we introduce the first execution-guided CoT-DPO paradigm, enabling end-to-end differentiable optimization of reasoning trajectories. Our approach unifies CoT-based inference with offline and online DPO, facilitating scalable, preference-driven refinement of SQL generation grounded in executable outcomes. Evaluated on BIRD and Spider benchmarks, our method substantially outperforms existing zero-shot CoT and non-CoT DPO baselines: LLaMA-3 70B achieves 68.51% and 68.53% execution accuracy on the BIRD dev and test sets, respectively—up from 57.37%—and attains 86.59% on Spider’s test set, establishing new single-model state-of-the-art performance.
📝 Abstract
Text-to-SQL demands precise reasoning to convert natural language questions into structured queries. While large language models (LLMs) excel in many reasoning tasks, their ability to leverage Chain-of-Thought (CoT) reasoning for text-to-SQL remains underexplored. We identify critical limitations: zero-shot CoT offers minimal gains, and Direct Preference Optimization (DPO) applied without CoT yields marginal improvements. We propose ExCoT, a novel framework that iteratively optimizes open-source LLMs by combining CoT reasoning with off-policy and on-policy DPO, relying solely on execution accuracy as feedback. This approach eliminates the need for reward models or human-annotated preferences. Our experimental results demonstrate significant performance gains: ExCoT improves execution accuracy on BIRD dev set from 57.37% to 68.51% and on Spider test set from 78.81% to 86.59% for LLaMA-3 70B, with Qwen-2.5-Coder demonstrating similar improvements. Our best model achieves state-of-the-art performance in the single-model setting on both BIRD and Spider datasets, notably achieving 68.53% on the BIRD test set.