π€ AI Summary
This work addresses the challenges of inefficient joint optimization between video understanding and generation, poor cross-modal compatibility, and high training costs in multimodal unified modeling. To this end, we propose the first end-to-end multimodal model built upon a single Transformer architecture, capable of jointly understanding and generating images and videos. Our key contributions are: (1) a multimodal warm-up strategy to mitigate initialization bias arising from modality heterogeneity; (2) a feature pre-scaling mechanism to harmonize feature scales across visual, linguistic, and temporal modalities; and (3) multimodal adaptive layer normalization (AdaLN) for dynamic, cross-modal conditional modulation. Under constrained training budgets, our model surpasses existing unified multimodal models on multiple image and video understanding and generation benchmarks. The source code is publicly available.
π Abstract
With the advancement of language models, unified multimodal understanding and generation have made significant strides, with model architectures evolving from separated components to unified single-model frameworks. This paper explores an efficient training paradigm to build a single transformer for unified multimodal understanding and generation. Specifically, we propose a multimodal warmup strategy utilizing prior knowledge to extend capabilities. To address cross-modal compatibility challenges, we introduce feature pre-scaling and multimodal AdaLN techniques. Integrating the proposed technologies, we present the HaploOmni, a new single multimodal transformer. With limited training costs, HaploOmni achieves competitive performance across multiple image and video understanding and generation benchmarks over advanced unified models. All codes will be made public at https://github.com/Tencent/HaploVLM.