On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses a critical limitation in existing reinforcement learning from verifiable reward (RLVR) analyses, which overly emphasize update magnitude while neglecting the pivotal role of update direction in enhancing large language models’ reasoning capabilities. For the first time, we treat update direction as a central analytical dimension, introducing a characterization based on token-level signed log-probability differences (Δlog p). We propose two strategies—test-time extrapolation and training-time reweighting—to precisely identify and amplify sparse updates essential for reasoning. Through rigorous statistical analysis, token substitution interventions, and a verifiable reward framework, we demonstrate consistent performance gains across multiple models and reasoning benchmarks, establishing that update direction provides a more effective signal than magnitude for guiding RLVR optimization.

Technology Category

Application Category

📝 Abstract

Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the \textbf{magnitude} of these updates, largely overlooking their \textbf{direction}. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR's effects, which can be captured by the signed, token-level log probability difference $Δ\log p$ between the base and final RLVR models. Through statistical analysis and token-replacement interventions, we demonstrate that $Δ\log p$ more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (\eg divergence or entropy). Building on this insight, we propose two practical applications: (1) a \textit{test-time extrapolation} method that amplifies the policy along the learned $Δ\log p$ direction to improve reasoning accuracy without further training; (2) a \textit{training-time reweighting} method that focuses learning on low-probability (corresponding to higher $Δ\log p$) tokens, which improves reasoning performance across models and benchmarks. Our work establishes the direction of change as a key principle for analyzing and improving RLVR.

Problem

Research questions and friction points this paper is trying to address.

RLVR

update direction

large language models

reasoning

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

RLVR

update direction

Δlog p