🤖 AI Summary
This work addresses the limitations of existing unified vision-language models, which rely on pixel-space decoding to bridge visual understanding and generation, resulting in inefficient cross-modal reasoning and suboptimal performance. To overcome this, we propose the first unified framework that eliminates the need for pixel-space intermediaries by constructing a shared semantic latent space, enabling seamless integration of comprehension and generation. Our approach removes encoder-decoder biases, supports flexible interleaved cross-modal reasoning, and introduces self-reflective generation alongside dynamic world modeling directly in the latent space. Evaluated on the Visual Spatial Planning benchmark, the method achieves state-of-the-art performance, significantly improving both generative quality and accuracy in predicting future visual states.
📝 Abstract
Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation. Beyond improved computational efficiency, the shared representation substantially alleviates codec bias and strengthens cross-modal alignment, allowing LatentUM to achieve state-of-the-art performance on the Visual Spatial Planning benchmark, push the limits of visual generation through self-reflection, and support world modeling by predicting future visual states within the shared semantic latent space.