Multimodal Chain of Continuous Thought for Latent-Space Reasoning in Vision-Language Models

📅 2025-08-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing chain-of-thought (CoT) methods model reasoning as discrete textual sequences, limiting their ability to achieve dynamic cross-modal alignment among audio, visual, and textual modalities in multimodal settings. To address this, we propose Multimodal Continuous Chain-of-Thought (MCOUT), the first framework that migrates reasoning from natural language space into a shared latent space, representing thoughts as continuous latent vectors instead of discrete tokens and iteratively fusing visual and textual semantics. MCOUT constructs continuous thought vectors from the final-layer hidden states of a language model and introduces a novel multimodal latent-space attention mechanism that enables human-like reflective cross-modal alignment. Evaluated on MMMU, ScienceQA, and MMStar benchmarks, MCOUT achieves up to an 8.23% absolute accuracy improvement and an 8.27% BLEU score gain over strong baselines, demonstrating substantial gains in multimodal reasoning capability.

Technology Category

Application Category

📝 Abstract
Many reasoning techniques for large multimodal models adapt language model approaches, such as Chain-of-Thought (CoT) prompting, which express reasoning as word sequences. While effective for text, these methods are suboptimal for multimodal contexts, struggling to align audio, visual, and textual information dynamically. To explore an alternative paradigm, we propose the Multimodal Chain of Continuous Thought (MCOUT), which enables reasoning directly in a joint latent space rather than in natural language. In MCOUT, the reasoning state is represented as a continuous hidden vector, iteratively refined and aligned with visual and textual embeddings, inspired by human reflective cognition. We develop two variants: MCOUT-Base, which reuses the language model`s last hidden state as the continuous thought for iterative reasoning, and MCOUT-Multi, which integrates multimodal latent attention to strengthen cross-modal alignment between visual and textual features. Experiments on benchmarks including MMMU, ScienceQA, and MMStar show that MCOUT consistently improves multimodal reasoning, yielding up to 8.23% accuracy gains over strong baselines and improving BLEU scores up to 8.27% across multiple-choice and open-ended tasks. These findings highlight latent continuous reasoning as a promising direction for advancing LMMs beyond language-bound CoT, offering a scalable framework for human-like reflective multimodal inference. Code is available at https://github.com/Hanhpt23/OmniMod.
Problem

Research questions and friction points this paper is trying to address.

Improves multimodal reasoning in vision-language models
Aligns audio, visual, and textual information dynamically
Enables reasoning in joint latent space, not just language
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal reasoning in joint latent space
Continuous hidden vector for iterative refinement
Multimodal latent attention for cross-modal alignment
🔎 Similar Papers
No similar papers found.