ExCoT: Optimizing Reasoning for Text-to-SQL with Execution Feedback

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

To address insufficient reasoning accuracy of large language models (LLMs) in Text-to-SQL tasks, this paper proposes a Chain-of-Thought (CoT)-enhanced Direct Preference Optimization (DPO) framework that relies solely on SQL execution feedback—eliminating the need for reward modeling or human annotations. Methodologically, we introduce the first execution-guided CoT-DPO paradigm, enabling end-to-end differentiable optimization of reasoning trajectories. Our approach unifies CoT-based inference with offline and online DPO, facilitating scalable, preference-driven refinement of SQL generation grounded in executable outcomes. Evaluated on BIRD and Spider benchmarks, our method substantially outperforms existing zero-shot CoT and non-CoT DPO baselines: LLaMA-3 70B achieves 68.51% and 68.53% execution accuracy on the BIRD dev and test sets, respectively—up from 57.37%—and attains 86.59% on Spider’s test set, establishing new single-model state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

Text-to-SQL demands precise reasoning to convert natural language questions into structured queries. While large language models (LLMs) excel in many reasoning tasks, their ability to leverage Chain-of-Thought (CoT) reasoning for text-to-SQL remains underexplored. We identify critical limitations: zero-shot CoT offers minimal gains, and Direct Preference Optimization (DPO) applied without CoT yields marginal improvements. We propose ExCoT, a novel framework that iteratively optimizes open-source LLMs by combining CoT reasoning with off-policy and on-policy DPO, relying solely on execution accuracy as feedback. This approach eliminates the need for reward models or human-annotated preferences. Our experimental results demonstrate significant performance gains: ExCoT improves execution accuracy on BIRD dev set from 57.37% to 68.51% and on Spider test set from 78.81% to 86.59% for LLaMA-3 70B, with Qwen-2.5-Coder demonstrating similar improvements. Our best model achieves state-of-the-art performance in the single-model setting on both BIRD and Spider datasets, notably achieving 68.53% on the BIRD test set.

Problem

Research questions and friction points this paper is trying to address.

Improving text-to-SQL conversion accuracy via execution feedback

Enhancing Chain-of-Thought reasoning for SQL generation tasks

Optimizing LLMs without reward models or human preferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines CoT reasoning with DPO

Uses execution accuracy as feedback

Optimizes LLMs without reward models

🔎 Similar Papers

No similar papers found.