RoiRL: Efficient, Self-Supervised Reasoning with Offline Iterative Reinforcement Learning

πŸ“… 2025-10-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the reliance of large language models (LLMs) on costly online reinforcement learning or ground-truth rewards for improving reasoning capabilities, this paper proposes RoiRLβ€”a computationally efficient, self-supervised, offline iterative reinforcement learning framework. RoiRL eliminates the need for a reference model and instead optimizes the policy via a weighted log-likelihood objective, substantially reducing memory and computational overhead. It further constructs pseudo-rewards through majority voting over multiple model outputs, enabling fully label-free offline training. Experimental results demonstrate that RoiRL accelerates training by 2.5Γ— and consistently outperforms TTRL across multiple reasoning benchmarks. By decoupling policy improvement from external reward signals and expensive online interaction, RoiRL establishes a scalable, unsupervised paradigm for LLM self-evolution.

Technology Category

Application Category

πŸ“ Abstract
Reinforcement learning (RL) is central to improving reasoning in large language models (LLMs) but typically requires ground-truth rewards. Test-Time Reinforcement Learning (TTRL) removes this need by using majority-vote rewards, but relies on heavy online RL and incurs substantial computational cost. We propose RoiRL: Reasoning with offline iterative Reinforcement Learning, a family of lightweight offline learning alternatives that can target the same regularized optimal policies. Unlike TTRL, RoiRL eliminates the need to maintain a reference model and instead optimizes weighted log-likelihood objectives, enabling stable training with significantly lower memory and compute requirements. Experimental results show that RoiRL trains to 2.5x faster and consistently outperforms TTRL on reasoning benchmarks, establishing a scalable path to self-improving LLMs without labels.
Problem

Research questions and friction points this paper is trying to address.

Eliminates need for ground-truth rewards in language model reasoning
Reduces computational costs of reinforcement learning for reasoning
Enables stable self-supervised improvement without reference models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Offline iterative reinforcement learning for reasoning
Eliminates reference model with weighted log-likelihood optimization
Achieves faster training with lower computational requirements
πŸ”Ž Similar Papers
2023-06-06International Conference on Learning RepresentationsCitations: 4