HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation

πŸ“… 2025-06-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenges of inefficient joint optimization between video understanding and generation, poor cross-modal compatibility, and high training costs in multimodal unified modeling. To this end, we propose the first end-to-end multimodal model built upon a single Transformer architecture, capable of jointly understanding and generating images and videos. Our key contributions are: (1) a multimodal warm-up strategy to mitigate initialization bias arising from modality heterogeneity; (2) a feature pre-scaling mechanism to harmonize feature scales across visual, linguistic, and temporal modalities; and (3) multimodal adaptive layer normalization (AdaLN) for dynamic, cross-modal conditional modulation. Under constrained training budgets, our model surpasses existing unified multimodal models on multiple image and video understanding and generation benchmarks. The source code is publicly available.

Technology Category

Application Category

πŸ“ Abstract
With the advancement of language models, unified multimodal understanding and generation have made significant strides, with model architectures evolving from separated components to unified single-model frameworks. This paper explores an efficient training paradigm to build a single transformer for unified multimodal understanding and generation. Specifically, we propose a multimodal warmup strategy utilizing prior knowledge to extend capabilities. To address cross-modal compatibility challenges, we introduce feature pre-scaling and multimodal AdaLN techniques. Integrating the proposed technologies, we present the HaploOmni, a new single multimodal transformer. With limited training costs, HaploOmni achieves competitive performance across multiple image and video understanding and generation benchmarks over advanced unified models. All codes will be made public at https://github.com/Tencent/HaploVLM.
Problem

Research questions and friction points this paper is trying to address.

Build single transformer for multimodal understanding and generation
Address cross-modal compatibility with feature pre-scaling and AdaLN
Achieve competitive performance with limited training costs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified single transformer for multimodal tasks
Multimodal warmup strategy with prior knowledge
Feature pre-scaling and multimodal AdaLN techniques
πŸ”Ž Similar Papers
No similar papers found.
Yicheng Xiao
Yicheng Xiao
Tsinghua University
Artificial IntelligenceMultimodal Learning
L
Lin Song
ARC Lab, Tencent PCG
R
Rui Yang
The University of Hong Kong
C
Cheng Cheng
Xi’an JiaoTong University
D
Dijkstra Xu
Z
Zhaoyang Zhang
ARC Lab, Tencent PCG
Y
Yixiao Ge
ARC Lab, Tencent PCG
Xiu Li
Xiu Li
Bytedance Seed
Computer VisionComputer Graphics3D Vision
Ying Shan
Ying Shan
Distinguished Scientist at Tencent, Director of ARC Lab & AI Lab CVC
Deep learningcomputer visionmachine learningpaid searchdisplay ads