🤖 AI Summary
Existing Text-to-SQL approaches rely on static execution feedback and lack real-time error correction capability. To address this, we propose a multi-round tool-integrated reasoning reinforcement learning framework that jointly incorporates dynamic database interaction and context-aware progressive query refinement, introducing an execution-aware multi-turn reasoning paradigm. We enhance the GRPO algorithm by removing the KL-divergence constraint and designing a trajectory filtering mechanism to improve training stability and policy distribution consistency. Our method achieves 64.4% and 84.6% execution accuracy on BIRD Dev and SPIDER Dev, respectively—substantially outperforming prior state-of-the-art methods. The core contribution lies in being the first to deeply integrate dynamic execution feedback into end-to-end differentiable, error-correctable multi-round RL training for Text-to-SQL generation.
📝 Abstract
As large language models (LLMs) are increasingly used in Text-to-SQL tasks, Reinforcement Learning (RL) has become a common method for improving performance. Existing methods primarily rely on static execution feedback, which restricts real-time error correction. However, integrating multi-turn tool invocation along with dynamic feedback could significantly improve adaptability and robustness, ultimately enhancing model performance. To address these issues, we propose MTIR-SQL, an innovative Multi-turn Tool-Integrated Reasoning reinforcement learning framework for Text-to-SQL. Our approach introduces an execution-aware multi-turn reasoning paradigm that seamlessly incorporates database execution feedback at each reasoning step, enabling context-sensitive query generation and progressive refinement throughout the reasoning process. The framework extends the GRPO algorithm to accommodate complex multi-turn interaction scenarios. Considering the training instability characteristics of MTIR and the potential for significant Deviation of model distribution from the initial model, we enhance the GRPO algorithm by adding a trajectory filtering mechanism and removing KL loss constraints. Experimental results demonstrate that MTIR-SQL, with 4B parameters, achieves extbf{64.4}% accuracy in the BIRD Dev and 84.6% execution accuracy in the SPIDER Dev, significantly outperforming existing approaches.