Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) rely on explicit, multi-step reasoning, resulting in rigid coupling between perception and reasoning, unstable cross-modal interaction, and high computational overhead. This work proposes an implicit-space dynamic vision–language interleaved reasoning framework that eliminates explicit reasoning steps and enables human-like, nonlinear, synergistic multimodal processing. Our approach introduces three core innovations: (1) confidence-guided policy gradient optimization to directly steer reasoning trajectories within the latent space; (2) dynamic visual patch retrieval and injection, selectively activating salient perceptual signals on demand; and (3) implicit “think token” updating with adaptive multimodal feature fusion. The method is architecture-agnostic and compatible with mainstream MLLM backbones. Evaluated on seven authoritative multimodal reasoning benchmarks, it achieves significant accuracy improvements while reducing computational cost—demonstrating superior efficiency and robustness.

Technology Category

Application Category

📝 Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding and reasoning by incorporating Chain-of-Thought (CoT) reasoning in the semantic space. Building upon this, recent studies extend the CoT mechanism to the visual modality, enabling models to integrate visual information during reasoning through external tools or explicit image generation. However, these methods remain dependent on explicit step-by-step reasoning, unstable perception-reasoning interaction and notable computational overhead. Inspired by human cognition, we posit that thinking unfolds not linearly but through the dynamic interleaving of reasoning and perception within the mind. Motivated by this perspective, we propose DMLR, a test-time Dynamic Multimodal Latent Reasoning framework that employs confidence-guided latent policy gradient optimization to refine latent think tokens for in-depth reasoning. Furthermore, a Dynamic Visual Injection Strategy is introduced, which retrieves the most relevant visual features at each latent think token and updates the set of best visual patches. The updated patches are then injected into latent think token to achieve dynamic visual-textual interleaving. Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance while maintaining high inference efficiency.

Problem

Research questions and friction points this paper is trying to address.

Enables dynamic interleaving of reasoning and perception in latent space

Reduces reliance on explicit step-by-step reasoning and computational overhead

Improves multimodal reasoning performance while maintaining high inference efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Multimodal Latent Reasoning framework

Confidence-guided latent policy gradient optimization

Dynamic Visual Injection Strategy for feature retrieval

🔎 Similar Papers

What to align in multimodal contrastive learning?