Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reinforcement learning approaches reward only the final hidden state of Looped language models, making it difficult to optimize their multi-step implicit reasoning processes. This work proposes RLTT, a novel framework that, for the first time, enables dense reward assignment over the entire implicit chain-of-thought trajectory, allowing fine-grained training of the reasoning process without reliance on external verifiers. RLTT employs a trajectory-level credit assignment mechanism and serves as a direct drop-in replacement for GRPO, seamlessly integrating with the Ouro-2.6B-Thinking architecture. Experimental results demonstrate that RLTT achieves substantial accuracy improvements of 14.4%, 16.6%, and 10.0% on the MATH-500, AIME24, and BeyondAIME benchmarks, respectively, while also exhibiting strong transferability to non-mathematical tasks.

Technology Category

Application Category

📝 Abstract
Looped Language Models (LoopLMs) perform multi-step latent reasoning prior to token generation and outperform conventional LLMs on reasoning benchmarks at smaller parameter budgets. However, attempts to further improve LoopLM reasoning with reinforcement learning have failed - standard objectives such as Group Relative Policy Optimization (GRPO) only assign credit to the final latent state, creating a fundamental mismatch with the model's internal computation. To resolve this, we introduce RLTT (Reward Latent Thought Trajectories), a reinforcement learning framework which distributes reward across the full latent reasoning trajectory. RLTT provides dense, trajectory-level credit assignment without relying on external verifiers and can directly replace GRPO with negligible overhead. Across extensive experiments with Ouro-2.6B-Thinking under identical training and inference conditions, RLTT yields substantial improvements over GRPO on challenging mathematical reasoning benchmarks, improving accuracy by +14.4% on MATH-500, +16.6% on AIME24, and +10.0% on BeyondAIME. Despite being trained exclusively on mathematics, RLTT also transfers effectively to non-mathematical reasoning benchmarks, demonstrating the effectiveness of trajectory-level credit assignment for reinforcement learning in LoopLMs.
Problem

Research questions and friction points this paper is trying to address.

Looped Language Models
reinforcement learning
credit assignment
latent reasoning
trajectory-level reward
Innovation

Methods, ideas, or system contributions that make the work stand out.

Looped Language Models
Reinforcement Learning
Latent Thought Trajectories
Credit Assignment
Mathematical Reasoning
J
Jonathan Williams
Department of Computer Science, Princeton University, Princeton NJ, U.S.A.
Esin Tureci
Esin Tureci
Senior Researcher, Princeton University