Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal reasoning methods rely on human-annotated explicit visual–textual reasoning steps, incurring high annotation costs and substantial inference latency. This paper proposes an implicit latent-space interleaved reasoning framework that eliminates the need for explicit intermediate annotations by jointly modeling textual states and selective image embeddings within a shared latent space, enabling end-to-end vision–language collaborative reasoning. Key contributions include: (1) the first implicit multimodal reasoning paradigm that requires no explicit visual–textual step-level annotations; and (2) a progressive multi-stage training strategy that effectively balances reasoning accuracy and computational efficiency. Evaluated on M3CoT and ScienceQA, our method achieves an average accuracy improvement of 5.45% while accelerating inference by over 5×, demonstrating significantly enhanced generalization capability and practical applicability.

Technology Category

Application Category

📝 Abstract
Multimodal reasoning aims to enhance the capabilities of MLLMs by incorporating intermediate reasoning steps before reaching the final answer. It has evolved from text-only reasoning to the integration of visual information, enabling the thought process to be conveyed through both images and text. Despite its effectiveness, current multimodal reasoning methods depend on explicit reasoning steps that require labor-intensive vision-text annotations and inherently introduce significant inference latency. To address these issues, we introduce multimodal latent reasoning with the advantages of multimodal representation, reduced annotation, and inference efficiency. To facilicate it, we propose Interleaved Vision-Text Latent Reasoning (IVT-LR), which injects both visual and textual information in the reasoning process within the latent space. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: latent text (the hidden states from the previous step) and latent vision (a set of selected image embeddings). We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps. Experiments on M3CoT and ScienceQA demonstrate that our IVT-LR method achieves an average performance increase of 5.45% in accuracy, while simultaneously achieving a speed increase of over 5 times compared to existing approaches. Code available at https://github.com/FYYDCC/IVT-LR.
Problem

Research questions and friction points this paper is trying to address.

Addresses multimodal reasoning's dependency on explicit vision-text annotations
Reduces significant inference latency in multimodal reasoning processes
Enables efficient interleaved vision-text reasoning within latent space
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses latent space for multimodal reasoning
Combines latent text and vision embeddings
Employs progressive multi-stage training strategy
🔎 Similar Papers
No similar papers found.
C
Chao Chen
The Hong Kong Polytechnic University
Z
Zhixin Ma
Singapore Management University
Y
Yongqi Li
The Hong Kong Polytechnic University
Y
Yupeng Hu
Shandong University
Yinwei Wei
Yinwei Wei
Shandong University | National University of Singapore
Multimedia ComputingInformation RetrievalRecommender System
W
Wenjie Li
The Hong Kong Polytechnic University
L
Liqiang Nie
Harbin Institute of Technology (Shenzhen)