🤖 AI Summary
Existing reinforcement learning approaches optimize only output tokens, making them ill-suited for recurrent language models—such as those with the Ouro architecture—that rely on implicit iterative reasoning. This work proposes LoopRPT, the first framework to introduce reinforcement pretraining into recurrent language models. By leveraging an EMA teacher model and noisy hidden-state rollouts, LoopRPT propagates reinforcement signals directly to intermediate hidden states, enabling end-to-end optimization of the entire reasoning process. This approach overcomes the traditional limitation of reinforcement learning being confined to the output layer, significantly improving per-step representation quality across multiple model scales. LoopRPT achieves a Pareto advantage in balancing accuracy and computational cost, demonstrating notably stronger early-stage reasoning capabilities, especially on challenging tokens.
📝 Abstract
Looped language models (LoopLMs) perform iterative latent computation to refine internal representations, offering a promising alternative to explicit chain-of-thought (CoT) reasoning. However, existing reinforcement learning (RL) paradigms primarily target output tokens, creating a structural mismatch with looped architectures whose reasoning unfolds implicitly. In this work, we propose LoopRPT, a reinforcement pre-training framework tailored for LoopLMs. By reframing next-token prediction as a next-token reasoning task, LoopRPT assigns reinforcement signals directly to latent steps using an EMA teacher reference and noisy latent rollouts. This formulation enables RL to directly shape intermediate representations, compressing effective reasoning into fewer iterations. We instantiate LoopRPT on the Ouro architecture across multiple model scales. Results demonstrate that LoopRPT consistently improves per-step representation quality, achieving Pareto dominance in accuracy-computation trade-offs. Notably, significant gains on hard tokens indicate that LoopRPT enhances early-stage reasoning rather than merely encouraging premature exits. Our findings highlight reinforcement pre-training as a principled paradigm for learning efficient latent reasoning in LoopLMs.