Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing unified multimodal models (UMMs) suffer from inefficient pretraining paradigms and a scarcity of high-quality image–text paired data for their visual generation components. To address this, this work proposes IOMM, the first framework enabling purely image-based pretraining for UMM visual generators. It comprises two stages: initial masked modeling pretraining using only unlabeled images, followed by efficient fine-tuning that blends a small amount of image–text pairs with unlabeled images to substantially improve instruction alignment and generation quality. The IOMM-B variant (3.6B parameters), trained with approximately 1,050 H800 GPU hours, achieves state-of-the-art performance on GenEval (0.89) and WISE (0.55), outperforming strong baselines such as BAGEL-7B and BLIP3-o-4B, thereby demonstrating its dual advantages in training efficiency and generative capability.

Technology Category

Application Category

📝 Abstract

Unified Multimodal Models (UMMs) are often constrained by the pre-training of their $\textbf{visual generation components}$, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for $\textbf{UMM visual generation}$ and identify these two issues as the major bottlenecks. To address them, we propose $\textbf{Image-Only Training for UMMs (IOMM)}$, a data-efficient two-stage training framework. The first stage pre-trains the visual generative component $\textbf{exclusively}$ using abundant unlabeled image-only data, thereby removing the dependency on paired data $\textbf{for this costly phase}$. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only $\sim \textbf{1050}$ H800 GPU hours (with the vast majority, $\textbf{1000}$ hours, dedicated to the efficient $\textbf{image-only pre-training stage}$). It achieves $\textbf{0.89}$ on GenEval and $\textbf{0.55}$ on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available $\href{https://github.com/LINs-lab/IOMM}{https://github.com/LINs-lab/IOMM}$.

Problem

Research questions and friction points this paper is trying to address.

Unified Multimodal Models

visual generation

pre-training

text-image paired data

training efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

image-only pre-training

masked modeling

unified multimodal models