Not All Steps are Informative: On the Linearity of LLMs' RLVR Training

📅 2026-01-08
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of reinforcement learning for visual reasoning (RLVR), primarily caused by prolonged exploration. The study is the first to identify a strong linear trend in both model weights and output logits throughout RLVR training. Leveraging this observation, the authors propose an extrapolation-based training strategy that predicts future model states by linearly extrapolating from intermediate checkpoints, thereby eliminating the need for continuous training. Two variants are introduced: weight extrapolation achieves performance comparable to standard RL while substantially reducing computational overhead, and logit extrapolation consistently outperforms continued training across four benchmarks. These findings challenge conventional RLVR training paradigms and offer a more efficient alternative grounded in the discovered linear dynamics.

Technology Category

Application Category

📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has become a central component of large language model (LLM) post-training. Unlike supervised fine-tuning (SFT), RLVR lets an LLM generate multiple candidate solutions and reinforces those that lead to a verifiably correct final answer. However, in practice, RLVR often requires thousands of training steps to reach strong performance, incurring substantial computation largely attributed to prolonged exploration. In this work, we make a surprising observation: during RLVR, LLMs evolve in a strongly linear manner. Specifically, both model weights and model output log-probabilities exhibit strong linear correlations with RL training steps. This suggests that RLVR predominantly amplifies trends that emerge early in training, rather than continuously discovering new behaviors throughout the entire optimization trajectory. Motivated by this linearity, we investigate whether future model states can be predicted from intermediate checkpoints via extrapolation, avoiding continued expensive training. We show that Weight Extrapolation produces models with performance comparable to standard RL training while requiring significantly less computation. Moreover, Logits Extrapolation consistently outperforms continued RL training on mathematics and code benchmarks by extrapolating beyond the step range where RL training remains stable. Our code is available at https://github.com/Miaow-Lab/RLVR-Linearity
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning with Verifiable Rewards
Large Language Models
Training Efficiency
Computational Cost
Post-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

linearity
weight extrapolation
logits extrapolation
RLVR
efficient training
🔎 Similar Papers
No similar papers found.
Tianle Wang
Tianle Wang
Brookhaven National Lab
High performance computation
Z
Zhongyuan Wu
Li Auto Inc.; Beihang University
S
Shenghao Jin
Li Auto Inc.; Beihang University
H
Hao Xu
Li Auto Inc.
W
Wei Chen
Li Auto Inc.
N
Ning Miao
Department of Data Science, City University of Hong Kong; Hong Kong Institute of AI for Science, City University of Hong Kong; Li Auto Inc.