🤖 AI Summary
Existing text-to-SQL reinforcement learning (RL) approaches rely either on costly SQL execution against databases or large language model (LLM)-based scoring, resulting in high latency and substantial GPU memory overhead. This work proposes an execution-free RL fine-tuning framework for text-to-SQL. Its core contributions are: (1) GMNScore, the first reward model leveraging graph-structured matching of SQL queries to capture semantic and syntactic fidelity without execution; and (2) StepRTM, a stepwise reward mechanism integrating intermediate supervision from CTE subqueries and lightweight graph representation learning. Crucially, the framework eliminates dependence on database execution or LLM scoring during training. Evaluated on Spider and BIRD benchmarks, it achieves 4.2–6.8% absolute gains in SQL execution accuracy, reduces inference latency by 72%, and cuts GPU memory consumption by 65%, significantly outperforming both execution-based and LLM-based reward methods.
📝 Abstract
Reinforcement learning (RL) has been widely adopted to enhance the performance of large language models (LLMs) on Text-to-SQL tasks. However, existing methods often rely on execution-based or LLM-based Bradley-Terry reward models. The former suffers from high execution latency caused by repeated database calls, whereas the latter imposes substantial GPU memory overhead, both of which significantly hinder the efficiency and scalability of RL pipelines. To this end, we propose a novel Text-to-SQL RL fine-tuning framework named Graph-Reward-SQL, which employs the GMNScore outcome reward model. We leverage SQL graph representations to provide accurate reward signals while significantly reducing inference time and GPU memory usage. Building on this foundation, we further introduce StepRTM, a stepwise reward model that provides intermediate supervision over Common Table Expression (CTE) subqueries. This encourages both functional correctness and structural clarity of SQL. Extensive comparative and ablation experiments on standard benchmarks, including Spider and BIRD, demonstrate that our method consistently outperforms existing reward models.