🤖 AI Summary
This work addresses the challenge of training large language models via reinforcement learning in long-context scenarios, where reliance on sparse rewards derived solely from final answers leads to vanishing gradients during context localization. The authors formally prove, for the first time, that outcome-only rewards inherently suffer from gradient vanishing in this setting. To overcome this limitation, they propose a verifiable dense contextual reward mechanism that introduces auxiliary rewards to directly incentivize the selection of correct evidence. Built upon this mechanism, their Reinforcement Learning with Verifiable Rewards (RLVR) framework achieves substantial performance gains on benchmarks such as RULER-QA and LongBench v2, improving the score of a 14B-parameter model on RULER-QA from 73.17 to 88.90.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual grounding--the ability to find and reason over externally provided information. We identify a key reason for this failure: a reward based solely on the final answer is too sparse to effectively guide the model for identifying relevant evidence. We formally prove that the outcome-only reward leads to significant vanishing gradients for the context grounding process, rendering learning intractable. To overcome this bottleneck, we introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward. This auxiliary signal directly incentivizes the model for selecting the correct grounding information, providing a robust learning gradient that solves the underlying optimization challenge. We validate our method on challenging long-context benchmarks using Qwen and LLaMA models. LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks, e.g., boosting a 14B model's scores on RULER-QA from 73.17 to 88.90 and on LongBench v2 from 39.8 to 46.5. Our work demonstrates that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications. Our code is available at https://github.com/real-absolute-AI/LongRLVR.