🤖 AI Summary
This work addresses the limitations of existing visual chain-of-thought reasoning methods, which suffer from semantic collapse due to signal discretization and optimization bottlenecks stemming from reliance on external tools, hindering effective learning in hybrid discrete-continuous action spaces. To overcome these challenges, the authors propose HyLaR, a framework that jointly models discrete text generation and continuous visual latent states. Central to HyLaR is the Decoupled Policy Optimization (DePO) algorithm, which decomposes the policy gradient objective and applies separate trust-region constraints to textual and latent variables. Training stability is further enhanced by incorporating a closed-form von Mises-Fisher distribution KL regularizer. Combined with supervised fine-tuning and multimodal implicit reasoning, HyLaR significantly outperforms standard multimodal large language models and state-of-the-art implicit reasoning approaches on benchmarks evaluating fine-grained perception and general multimodal understanding.
📝 Abstract
Chain-of-Thought (CoT) reasoning significantly elevates the complex problem-solving capabilities of multimodal large language models (MLLMs). However, adapting CoT to vision typically discretizes signals to fit LLM inputs, causing early semantic collapse and discarding fine-grained details. While external tools can mitigate this, they introduce a rigid bottleneck, confining reasoning to predefined operations. Although recent latent reasoning paradigms internalize visual states to overcome these limitations, optimizing the resulting hybrid discrete-continuous action space remains challenging. In this work, we propose HyLaR (Hybrid Latent Reasoning), a framework that seamlessly interleaves discrete text generation with continuous visual latent representations. Specifically, following an initial cold-start supervised fine-tuning (SFT), we introduce DePO (Decoupled Policy Optimization) to enable effective reinforcement learning within this hybrid space. DePO decomposes the policy gradient objective, applying independent trust-region constraints to the textual and latent components, alongside an exact closed-form von Mises-Fisher (vMF) KL regularizer. Extensive experiments demonstrate that HyLaR outperforms standard MLLMs and state-of-the-art latent reasoning approaches across fine-grained perception and general multimodal understanding benchmarks. Code is available at https://github.com/EthenCheng/HyLaR.