Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Existing reinforcement learning approaches reward only the final hidden state of Looped language models, making it difficult to optimize their multi-step implicit reasoning processes. This work proposes RLTT, a novel framework that, for the first time, enables dense reward assignment over the entire implicit chain-of-thought trajectory, allowing fine-grained training of the reasoning process without reliance on external verifiers. RLTT employs a trajectory-level credit assignment mechanism and serves as a direct drop-in replacement for GRPO, seamlessly integrating with the Ouro-2.6B-Thinking architecture. Experimental results demonstrate that RLTT achieves substantial accuracy improvements of 14.4%, 16.6%, and 10.0% on the MATH-500, AIME24, and BeyondAIME benchmarks, respectively, while also exhibiting strong transferability to non-mathematical tasks.

Technology Category

Application Category

📝 Abstract

Looped Language Models (LoopLMs) perform multi-step latent reasoning prior to token generation and outperform conventional LLMs on reasoning benchmarks at smaller parameter budgets. However, attempts to further improve LoopLM reasoning with reinforcement learning have failed - standard objectives such as Group Relative Policy Optimization (GRPO) only assign credit to the final latent state, creating a fundamental mismatch with the model's internal computation. To resolve this, we introduce RLTT (Reward Latent Thought Trajectories), a reinforcement learning framework which distributes reward across the full latent reasoning trajectory. RLTT provides dense, trajectory-level credit assignment without relying on external verifiers and can directly replace GRPO with negligible overhead. Across extensive experiments with Ouro-2.6B-Thinking under identical training and inference conditions, RLTT yields substantial improvements over GRPO on challenging mathematical reasoning benchmarks, improving accuracy by +14.4% on MATH-500, +16.6% on AIME24, and +10.0% on BeyondAIME. Despite being trained exclusively on mathematics, RLTT also transfers effectively to non-mathematical reasoning benchmarks, demonstrating the effectiveness of trajectory-level credit assignment for reinforcement learning in LoopLMs.

Problem

Research questions and friction points this paper is trying to address.

Looped Language Models

reinforcement learning

credit assignment

latent reasoning

trajectory-level reward

Innovation

Methods, ideas, or system contributions that make the work stand out.

Looped Language Models

Reinforcement Learning

Latent Thought Trajectories