HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges of inefficient joint optimization between video understanding and generation, poor cross-modal compatibility, and high training costs in multimodal unified modeling. To this end, we propose the first end-to-end multimodal model built upon a single Transformer architecture, capable of jointly understanding and generating images and videos. Our key contributions are: (1) a multimodal warm-up strategy to mitigate initialization bias arising from modality heterogeneity; (2) a feature pre-scaling mechanism to harmonize feature scales across visual, linguistic, and temporal modalities; and (3) multimodal adaptive layer normalization (AdaLN) for dynamic, cross-modal conditional modulation. Under constrained training budgets, our model surpasses existing unified multimodal models on multiple image and video understanding and generation benchmarks. The source code is publicly available.

Technology Category

Application Category

📝 Abstract

With the advancement of language models, unified multimodal understanding and generation have made significant strides, with model architectures evolving from separated components to unified single-model frameworks. This paper explores an efficient training paradigm to build a single transformer for unified multimodal understanding and generation. Specifically, we propose a multimodal warmup strategy utilizing prior knowledge to extend capabilities. To address cross-modal compatibility challenges, we introduce feature pre-scaling and multimodal AdaLN techniques. Integrating the proposed technologies, we present the HaploOmni, a new single multimodal transformer. With limited training costs, HaploOmni achieves competitive performance across multiple image and video understanding and generation benchmarks over advanced unified models. All codes will be made public at https://github.com/Tencent/HaploVLM.

Problem

Research questions and friction points this paper is trying to address.

Build single transformer for multimodal understanding and generation

Address cross-modal compatibility with feature pre-scaling and AdaLN

Achieve competitive performance with limited training costs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified single transformer for multimodal tasks

Multimodal warmup strategy with prior knowledge

Feature pre-scaling and multimodal AdaLN techniques

🔎 Similar Papers

No similar papers found.

Authors to Follow