RoiRL: Efficient, Self-Supervised Reasoning with Offline Iterative Reinforcement Learning

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

To address the reliance of large language models (LLMs) on costly online reinforcement learning or ground-truth rewards for improving reasoning capabilities, this paper proposes RoiRL—a computationally efficient, self-supervised, offline iterative reinforcement learning framework. RoiRL eliminates the need for a reference model and instead optimizes the policy via a weighted log-likelihood objective, substantially reducing memory and computational overhead. It further constructs pseudo-rewards through majority voting over multiple model outputs, enabling fully label-free offline training. Experimental results demonstrate that RoiRL accelerates training by 2.5× and consistently outperforms TTRL across multiple reasoning benchmarks. By decoupling policy improvement from external reward signals and expensive online interaction, RoiRL establishes a scalable, unsupervised paradigm for LLM self-evolution.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) is central to improving reasoning in large language models (LLMs) but typically requires ground-truth rewards. Test-Time Reinforcement Learning (TTRL) removes this need by using majority-vote rewards, but relies on heavy online RL and incurs substantial computational cost. We propose RoiRL: Reasoning with offline iterative Reinforcement Learning, a family of lightweight offline learning alternatives that can target the same regularized optimal policies. Unlike TTRL, RoiRL eliminates the need to maintain a reference model and instead optimizes weighted log-likelihood objectives, enabling stable training with significantly lower memory and compute requirements. Experimental results show that RoiRL trains to 2.5x faster and consistently outperforms TTRL on reasoning benchmarks, establishing a scalable path to self-improving LLMs without labels.

Problem

Research questions and friction points this paper is trying to address.

Eliminates need for ground-truth rewards in language model reasoning

Reduces computational costs of reinforcement learning for reasoning

Enables stable self-supervised improvement without reference models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Offline iterative reinforcement learning for reasoning

Eliminates reference model with weighted log-likelihood optimization

Achieves faster training with lower computational requirements

🔎 Similar Papers

Stabilizing Contrastive RL: Techniques for Robotic Goal Reaching from Offline Data