TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Unified multimodal models (UMMs) suffer from inconsistent representation spaces for image and video understanding and generation due to the use of separate encoders. Method: This paper proposes a native unified continuous visual representation space, achieved by cascading a VAE encoder with a learnable representation encoder—enabling end-to-end joint training for both understanding and generation of images and videos within a shared latent space. Contribution/Results: This design fosters mutual enhancement between understanding and generation tasks; we further find that stronger pretrained representation encoders systematically improve multimodal performance. The approach achieves state-of-the-art results across diverse benchmarks in image/video understanding, generation, and editing. These results empirically validate the effectiveness, generalizability, and scalability of a unified representation space for multimodal learning.

Technology Category

Application Category

📝 Abstract

Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows end-to-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA's unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and generation data allows the two tasks to benefit from each other rather than interfere. Our extensive experiments on multimodal understanding and generation benchmarks show that TUNA achieves state-of-the-art results in image and video understanding, image and video generation, and image editing, demonstrating the effectiveness and scalability of its unified representation design.

Problem

Research questions and friction points this paper is trying to address.

Unified multimodal models for joint understanding and generation

Addressing representation mismatch in visual encoding

Enhancing multimodal tasks with unified visual representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascades VAE encoder with representation encoder for unified visual space

Unified representation enables end-to-end multimodal understanding and generation

Joint training on understanding and generation data enhances both tasks

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs