A State-of-the-Art SQL Reasoning Model using RLVR

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

For domain-specific natural language-to-SQL (NL2SQL) tasks in enterprise settings requiring integration of domain knowledge, this paper proposes a verifiable-reward-based reinforcement learning framework (RLVR), the first to be applied to the BIRD benchmark. Methodologically, it combines TAO-based offline pre-warming with online RL fine-tuning—requiring no additional annotations or proprietary models—and incorporates self-consistency mechanisms and optimized prompting strategies to enhance semantic parsing robustness. Key contributions are: (1) the first systematic application of RLVR to NL2SQL, ensuring reward verifiability and training reproducibility; and (2) a lightweight, efficient pipeline that substantially reduces generation overhead. Experiments show a single-submission accuracy of 73.56%, rising to 75.68% with self-consistency—surpassing current state-of-the-art methods while using fewer inference steps.

Technology Category

Application Category

📝 Abstract

Developing custom reasoning models via Reinforcement Learning (RL) that can incorporate organization-specific knowledge has great potential to address problems faced by enterprise customers. In many of these problems, the reward function is verifiable, a setting termed RL with Verifiable Rewards (RLVR). We apply RLVR to a popular data science benchmark called BIRD that measures the ability of an AI agent to convert a natural language query for a database to SQL executions. We apply a simple and general-purpose training recipe involving careful prompt and model selection, a warm-up stage using our offline RL approach called TAO, followed by rigorous online RLVR training. With no additional training data beyond the BIRD training set and no use of proprietary models, our very first submission to the BIRD leaderboard reached state-of-the-art accuracy on the private test set: 73.56% without self-consistency and 75.68% with self-consistency. In the latter case, our model also required fewer generations than the second-best approach. While BIRD is only a proxy task, the simplicity of our framework makes it broadly applicable to enterprise domains such as business intelligence, data science, and coding.

Problem

Research questions and friction points this paper is trying to address.

Develops RL model for natural language to SQL conversion

Achieves state-of-the-art accuracy on BIRD benchmark dataset

Provides general framework for enterprise data science applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses RLVR for SQL query translation

Applies TAO offline warm-up with online RL

Achieves state-of-the-art accuracy on BIRD benchmark

🔎 Similar Papers

Conformance Testing of Relational DBMS Against SQL Specifications