UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of jointly modeling multimodal understanding and generation within a unified visual framework, bridging the representation gap between text and images while enabling bidirectional cross-modal translation. We propose the first fully vision-native unified diffusion model: both text and images are encoded as RGB pixel sequences, establishing pixel-level bidirectional mapping; a rectified flow-driven diffusion Transformer architecture is adopted, with rendered text images serving as visual conditioning, and lightweight task embeddings enable multi-task joint training. Our approach achieves end-to-end visual unification at the model, task, and representation levels. It significantly improves semantic alignment in text-to-image and image-to-text generation, enhances visual understanding accuracy, and exhibits emergent fine-grained controllability and cycle-consistent cross-modal translation capability.

Technology Category

Application Category

📝 Abstract
We present UniModel, a unified generative model that jointly supports visual understanding and visual generation within a single pixel-to-pixel diffusion framework. Our goal is to achieve unification along three axes: the model, the tasks, and the representations. At the representation level, we eliminate modality discrepancies by mapping both text and images into a shared visual space: textual prompts are rendered as painted text images on a clean canvas, and all inputs and outputs are treated purely as RGB pixels. This yields a fully vision-native formulation of multimodal learning. At the task level, a broad range of vision-language problems are cast as pixel-to-pixel transformations in this visual space. For understanding tasks, the model takes an RGB image and produces a painted text image that visually encodes the semantic prediction. For generation tasks, painted text images serve as visual conditions that guide realistic and semantically aligned image synthesis. Captioning and text-to-image generation thus become different directions of the same underlying visual translation process. At the model level, we instantiate a single Unified Diffusion Transformer trained with rectified flow in pixel space. A shared backbone jointly learns bidirectional mappings between natural images and painted text images, with lightweight task embeddings to specify the desired direction. Experiments on text-to-image synthesis and image-to-text understanding demonstrate strong cross-modal alignment and emergent controllability such as cycle-consistent image-caption-image loops. Our initial exploration suggests that unifying model, tasks, and representations in a single visual space is a promising paradigm for general-purpose multimodal intelligence.
Problem

Research questions and friction points this paper is trying to address.

Unifying visual understanding and generation within a single pixel-to-pixel framework
Eliminating modality discrepancies by mapping text and images into shared visual space
Casting diverse vision-language tasks as bidirectional pixel-to-pixel transformations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified pixel-to-pixel diffusion framework for multimodal tasks
Text rendered as painted images in shared visual space
Single Unified Diffusion Transformer with rectified flow training
🔎 Similar Papers
No similar papers found.
C
Chi Zhang
Institute of Artificial Intelligence (TeleAI), China Telecom
Jiepeng Wang
Jiepeng Wang
The University of Hong Kong
3D VisionAIGCRobotics
Y
Youming Wang
Institute of Artificial Intelligence (TeleAI), China Telecom
Yuanzhi Liang
Yuanzhi Liang
UTS
Xiaoyan Yang
Xiaoyan Yang
Advanced Digital Sciences Center
databasedeep learningtext mining
Z
Zuoxin Li
Institute of Artificial Intelligence (TeleAI), China Telecom
Haibin Huang
Haibin Huang
Principal Research Scientist at TeleAI
Computer GraphicsComputer VisionGeometric Modeling3D Deep Learning
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI), China Telecom