LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing unified vision-language models, which rely on pixel-space decoding to bridge visual understanding and generation, resulting in inefficient cross-modal reasoning and suboptimal performance. To overcome this, we propose the first unified framework that eliminates the need for pixel-space intermediaries by constructing a shared semantic latent space, enabling seamless integration of comprehension and generation. Our approach removes encoder-decoder biases, supports flexible interleaved cross-modal reasoning, and introduces self-reflective generation alongside dynamic world modeling directly in the latent space. Evaluated on the Visual Spatial Planning benchmark, the method achieves state-of-the-art performance, significantly improving both generative quality and accuracy in predicting future visual states.
📝 Abstract
Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation. Beyond improved computational efficiency, the shared representation substantially alleviates codec bias and strengthens cross-modal alignment, allowing LatentUM to achieve state-of-the-art performance on the Visual Spatial Planning benchmark, push the limits of visual generation through self-reflection, and support world modeling by predicting future visual states within the shared semantic latent space.
Problem

Research questions and friction points this paper is trying to address.

unified models
interleaved cross-modal reasoning
visual understanding and generation
latent space
codec bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent-space unified model
interleaved cross-modal reasoning
shared semantic representation
visual understanding and generation
codec bias alleviation
🔎 Similar Papers
No similar papers found.