🤖 AI Summary
Neural text-to-SQL generation heavily relies on high-quality, manually annotated SQL queries—a major bottleneck in low-resource settings. Method: This paper proposes an execution-guided reinforcement learning (RL) framework that leverages question-answer pairs as weak supervision, eliminating dependence on gold-standard SQL annotations. It reformulates SQL generation as an execution-driven RL task and introduces Group Relative Policy Optimization (GRPO), a novel policy optimization framework designed for stable training under sparse, database-execution-based rewards. Contribution/Results: The approach significantly enhances symbolic reasoning and generalization capabilities. On the Spider benchmark, it improves SQL execution accuracy from 31.49% to 49.83% and reduces error rate to 14.71%. Remarkably, its performance approaches that of SQLCoder-70B—a model with an order-of-magnitude larger parameter count—establishing a new paradigm for resource-efficient, execution-aware text-to-SQL generation.
📝 Abstract
In this work, we study the problem of code generation with a large language model (LLM), with a focus on generating SQL queries from natural language questions. We ask: Instead of using supervised fine tuning with text-code pairs, can we tune a model by having it interact with a database engine? We frame this problem as a reinforcement learning problem where the model receives execution-based feedback from the environment in the form of scalar rewards. These rewards penalize execution failures and assign positive values when a query returns a correct answer. We use the rewards within the Group Relative Policy Optimization (GRPO) framework. We use a tabular reasoning benchmark to test and evaluate our findings. We find that with only weak supervision in the form of question-answer pairs, RL-tuning improves the accuracy of model generated SQL code from 31.49 to 49.83 while reducing error percentage from 25.43% to 14.71%. This improvement allowed the model nearly match the performance performance to the larger SQLCoder-70B model. Our work demonstrates the potential of using execution-based feedback to improve symbolic reasoning capabilities of LLMs.