ReLaX: Reasoning with Latent Exploration for Large Reasoning Models

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Reinforcement learning with verifiable rewards (RLVR) often induces entropy collapse in large reasoning models (LRMs), leading to premature policy convergence and performance saturation. To address this, we propose ReLaX—a novel exploration-regulation mechanism that introduces latent-variable dynamics modeling into reasoning-policy optimization for the first time. ReLaX leverages Koopman operator theory to linearize implicit state evolution and designs a dynamic spectral dispersion (DSD) metric as a regularization term, explicitly constraining entropy decay during training to achieve adaptive balance between exploration and exploitation. Empirically, ReLaX significantly alleviates convergence stagnation across both multimodal and text-only reasoning tasks. It consistently achieves state-of-the-art performance on multiple benchmarks, including MMLU, GSM8K, and MMMU. This work establishes a new paradigm for robust RL-based training of LRMs, advancing the principled integration of dynamical systems theory and reinforcement learning in reasoning optimization.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated remarkable potential in enhancing the reasoning capability of Large Reasoning Models (LRMs). However, RLVR often leads to entropy collapse, resulting in premature policy convergence and performance saturation. While manipulating token-level entropy has proven effective for promoting policy exploration, we argue that the latent dynamics underlying token generation encode a far richer computational structure for steering policy optimization toward a more effective exploration-exploitation tradeoff. To enable tractable analysis and intervention of the latent dynamics of LRMs, we leverage Koopman operator theory to obtain a linearized representation of their hidden-state dynamics. This enables us to introduce Dynamic Spectral Dispersion (DSD), a new metric to quantify the heterogeneity of the model's latent dynamics, serving as a direct indicator of policy exploration. Building upon these foundations, we propose Reasoning with Latent eXploration (ReLaX), a paradigm that explicitly incorporates latent dynamics to regulate exploration and exploitation during policy optimization. Comprehensive experiments across a wide range of multimodal and text-only reasoning benchmarks show that ReLaX significantly mitigates premature convergence and consistently achieves state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

Mitigates entropy collapse in reinforcement learning for large reasoning models

Introduces latent dynamics to regulate exploration-exploitation tradeoff in policy optimization

Quantifies latent heterogeneity to prevent premature convergence and improve performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging Koopman operator theory for linearized latent dynamics

Introducing Dynamic Spectral Dispersion metric for exploration quantification

Proposing ReLaX paradigm to regulate exploration-exploitation via latent dynamics

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting