Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses three key challenges in implicit reasoning reinforcement learning: inefficient sampling, misalignment between exploration and optimization objectives, and non-closedness of mixed latent states. Building upon the Group Relative Policy Optimization (GRPO) framework, the authors introduce three core innovations: advantage masking for invalid samples, one-sided noise sampling, and a mechanism that selects the first token of the optimal correct reasoning path. By integrating latent-space modeling, Gumbel sampling, and advantage function correction, the proposed method achieves the first stable and scalable approach to implicit reasoning. Empirical results demonstrate a 7.86-point improvement in Pass@1 on low-difficulty tasks and a 4.27-point gain over explicit GRPO on high-difficulty tasks, while reducing reasoning chain length by 3–4× and significantly enhancing Pass@$k$ performance.

📝 Abstract

Latent reasoning offers a more efficient alternative to explicit reasoning by compressing intermediate reasoning into continuous representations and substantially shortening reasoning chains. However, existing latent reasoning methods mainly focus on supervised learning, and reinforcement learning in latent space remains highly unstable. We study this problem through the lens of Group Relative Policy Optimization (GRPO), and show that directly adapting GRPO to latent reasoning is fundamentally non-trivial: latent reasoning changes both the probability density and the sampling mechanism, causing three coupled bottlenecks: absence of intrinsic latent manifolds, where unconstrained exploration pushes rollouts off the valid latent manifold; exploration-optimization misalignment, where trajectory-level rewards can induce incorrect token-level updates; and latent mixture non-closure, where jointly reinforcing multiple correct latent paths can produce an invalid averaged state. To address them, we propose \textbf{Latent-GRPO}, which combines invalid-sample advantage masking, one-sided noise sampling, and optimal correct-path first-token selection. Across four low-difficulty benchmarks (e.g., GSM8K-Aug) and four high-difficulty benchmarks (e.g., AIME), Latent-GRPO improves over its latent initialization by 7.86 Pass@1 points on low-difficulty tasks and surpasses explicit GRPO by 4.27 points on high-difficulty tasks while using 3--4$\times$ shorter reasoning chains. It also achieves stronger pass@$k$ performance under Gumbel sampling. These results establish Latent-GRPO as an effective approach for stable and efficient latent reasoning.

Problem

Research questions and friction points this paper is trying to address.

latent reasoning

reinforcement learning

policy optimization

latent manifold

exploration-optimization misalignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Reasoning

Group Relative Policy Optimization

Reinforcement Learning