🤖 AI Summary
This work addresses two critical limitations in multimodal implicit reasoning: the attenuation of visual information due to language bias and the insufficient representational capacity of fixed-depth networks for complex semantic tokens. The study first identifies the previously overlooked issue of gradient instability in visual tokens and proposes a dynamic reasoning architecture to mitigate it. Specifically, fine-grained visual reinforcement is achieved through causal self-attention augmented with spatial consistency constraints, while a token-saliency-based adaptive depth routing mechanism allocates greater computational depth to semantically critical tokens. Furthermore, curriculum learning is employed to internalize explicit chain-of-thought reasoning into compact implicit representations. The resulting model achieves state-of-the-art performance across multiple benchmarks and demonstrates significantly faster inference compared to explicit reasoning approaches.
📝 Abstract
Multimodal latent reasoning has emerged as a promising paradigm that replaces explicit Chain-of-Thought (CoT) decoding with implicit feature propagation, simultaneously enhancing representation informativeness and reducing inference latency. By analyzing token-level gradient dynamics during latent training, we reveal two critical observations: (1) visual tokens exhibit significantly higher and more volatile gradient norms than their textual counterparts due to inherent language bias, resulting in systematic visual under-optimization; and (2) semantically simple tokens converge rapidly, whereas complex tokens exhibit persistent gradient instability constrained by fixed architectural depths. To address these limitations, we propose a visual replay module and routing depth scaling to collaboratively enhance visual perception and refine complicated latents for deeper contextual reasoning. The former module leverages causal self-attention to estimate token saliency, reinforcing fine-grained grounding through spatially-coherent constraints. Complementarily, the latter mechanism adaptively allocates additional reasoning steps to complex tokens, enabling deeper contextual refinement. Guided by a curriculum strategy that progressively internalizes explicit CoT into compact latent representations, our framework achieves state-of-the-art performance across diverse benchmarks while delivering substantial inference speedups over explicit CoT baselines.