Reward-SQL: Boosting Text-to-SQL via Stepwise Reasoning and Process-Supervised Rewards

📅 2025-05-07

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

To address the problem of process reward model (PRM) misuse in Text-to-SQL—leading to reasoning drift—we propose a staged training framework. First, we construct an interpretable Chain-of-CTEs stepwise reasoning baseline; then, we introduce PRM via a “cold-start → PRM-supervised” paradigm to deliver precise process-level feedback. Our method innovatively integrates GRPO-based online reinforcement learning with PRM-guided best-of-N sampling, fine-tuning the Qwen2.5-Coder-7B-Instruct model. This approach significantly mitigates PRM misuse and enhances reasoning stability, achieving 68.9% execution accuracy on the BIRD development set—a 13.1 percentage-point improvement over the baseline—and outperforming all comparable-scale models.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) have significantly improved performance on the Text-to-SQL task by leveraging their powerful reasoning capabilities. To enhance accuracy during the reasoning process, external Process Reward Models (PRMs) can be introduced during training and inference to provide fine-grained supervision. However, if misused, PRMs may distort the reasoning trajectory and lead to suboptimal or incorrect SQL generation.To address this challenge, we propose Reward-SQL, a framework that systematically explores how to incorporate PRMs into the Text-to-SQL reasoning process effectively. Our approach follows a"cold start, then PRM supervision"paradigm. Specifically, we first train the model to decompose SQL queries into structured stepwise reasoning chains using common table expressions (Chain-of-CTEs), establishing a strong and interpretable reasoning baseline. Then, we investigate four strategies for integrating PRMs, and find that combining PRM as an online training signal (GRPO) with PRM-guided inference (e.g., best-of-N sampling) yields the best results. Empirically, on the BIRD benchmark, Reward-SQL enables models supervised by a 7B PRM to achieve a 13.1% performance gain across various guidance strategies. Notably, our GRPO-aligned policy model based on Qwen2.5-Coder-7B-Instruct achieves 68.9% accuracy on the BIRD development set, outperforming all baseline methods under the same model size. These results demonstrate the effectiveness of Reward-SQL in leveraging reward-based supervision for Text-to-SQL reasoning. Our code is publicly available.

Problem

Research questions and friction points this paper is trying to address.

Enhancing Text-to-SQL accuracy via stepwise reasoning

Mitigating Process Reward Models misuse in SQL generation

Optimizing PRM integration strategies for better SQL outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses stepwise reasoning with Chain-of-CTEs

Integrates Process Reward Models (PRMs) effectively

Combines PRM online training with guided inference

🔎 Similar Papers

A Survey on Employing Large Language Models for Text-to-SQL Tasks