From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation

πŸ“… 2026-01-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the inefficiency and vulnerability to reward hacking in existing verifiable-reward-based reinforcement learning approaches for open-ended generation tasks, where ground-truth answers are often unavailable. The authors propose RLVRR, a novel method that extends verifiable rewards from single-point answers to a β€œreward chain” derived from high-quality reference texts. This framework establishes a dual-track verifiable reward mechanism by evaluating both content fidelity (via keyword retention) and stylistic quality (through large language model validation). By integrating the strengths of reinforcement learning and supervised fine-tuning, RLVRR unifies structured reasoning with open-ended generation in a single training paradigm. Extensive experiments across more than ten benchmarks demonstrate that RLVRR significantly outperforms supervised fine-tuning with ten times more data and state-of-the-art reward models, consistently improving generation quality, generalization, and output diversity.

Technology Category

Application Category

πŸ“ Abstract
Reinforcement learning with verifiable rewards (RLVR) succeeds in reasoning tasks (e.g., math and code) by checking the final verifiable answer (i.e., a verifiable dot signal). However, extending this paradigm to open-ended generation is challenging because there is no unambiguous ground truth. Relying on single-dot supervision often leads to inefficiency and reward hacking. To address these issues, we propose reinforcement learning with verifiable reference-based rewards (RLVRR). Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e, reward chain). Specifically, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts (e.g., keywords), and style, which evaluates adherence to stylistic properties through LLM-based verification. In this way, RLVRR combines the exploratory strength of RL with the efficiency and reliability of supervised fine-tuning (SFT). Extensive experiments on more than 10 benchmarks with Qwen and Llama models confirm the advantages of our approach. RLVRR (1) substantially outperforms SFT trained with ten times more data and advanced reward models, (2) unifies the training of structured reasoning and open-ended generation, and (3) generalizes more effectively while preserving output diversity. These results establish RLVRR as a principled and efficient path toward verifiable reinforcement learning for general-purpose LLM alignment. We release our code and data at https://github.com/YJiangcm/RLVRR.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
open-ended generation
verifiable rewards
reward hacking
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

reward chain
verifiable reference-based rewards
reinforcement learning
open-ended generation
LLM alignment
πŸ”Ž Similar Papers
No similar papers found.