π€ AI Summary
To address the reliance of large language models (LLMs) on costly online reinforcement learning or ground-truth rewards for improving reasoning capabilities, this paper proposes RoiRLβa computationally efficient, self-supervised, offline iterative reinforcement learning framework. RoiRL eliminates the need for a reference model and instead optimizes the policy via a weighted log-likelihood objective, substantially reducing memory and computational overhead. It further constructs pseudo-rewards through majority voting over multiple model outputs, enabling fully label-free offline training. Experimental results demonstrate that RoiRL accelerates training by 2.5Γ and consistently outperforms TTRL across multiple reasoning benchmarks. By decoupling policy improvement from external reward signals and expensive online interaction, RoiRL establishes a scalable, unsupervised paradigm for LLM self-evolution.
π Abstract
Reinforcement learning (RL) is central to improving reasoning in large language models (LLMs) but typically requires ground-truth rewards. Test-Time Reinforcement Learning (TTRL) removes this need by using majority-vote rewards, but relies on heavy online RL and incurs substantial computational cost. We propose RoiRL: Reasoning with offline iterative Reinforcement Learning, a family of lightweight offline learning alternatives that can target the same regularized optimal policies. Unlike TTRL, RoiRL eliminates the need to maintain a reference model and instead optimizes weighted log-likelihood objectives, enabling stable training with significantly lower memory and compute requirements. Experimental results show that RoiRL trains to 2.5x faster and consistently outperforms TTRL on reasoning benchmarks, establishing a scalable path to self-improving LLMs without labels.