UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception

๐Ÿ“… 2025-09-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing diffusion models excel at text-to-image generation but struggle to generalize to multimodal understanding, editing, and perception tasksโ€”typically relying on separate vision-language models or modular architectures, leading to semantic fragmentation and computational inefficiency. To address this, we propose UniDiffusion: the first unified multimodal framework built upon a single diffusion Transformer. It employs a dual-stream diffusion training paradigm that jointly optimizes intrinsic modality reconstruction (e.g., image denoising) and cross-modal alignment (e.g., image-text matching), enabling end-to-end integration of generation, understanding, editing, and perception. Key innovations include a novel cross-modal semantic alignment loss and an end-to-end multi-task learning objective. We further introduce SemGen-Bench, a dedicated benchmark for evaluating multimodal generative understanding. Experiments demonstrate consistent and significant improvements over state-of-the-art methods across diverse multimodal tasks, validating the feasibility and superiority of diffusion-based unified multimodal intelligence.

Technology Category

Application Category

๐Ÿ“ Abstract
The remarkable success of diffusion models in text-to-image generation has sparked growing interest in expanding their capabilities to a variety of multi-modal tasks, including image understanding, manipulation, and perception. These tasks require advanced semantic comprehension across both visual and textual modalities, especially in scenarios involving complex semantic instructions. However, existing approaches often rely heavily on vision-language models (VLMs) or modular designs for semantic guidance, leading to fragmented architectures and computational inefficiency. To address these challenges, we propose UniAlignment, a unified multimodal generation framework within a single diffusion transformer. UniAlignment introduces a dual-stream diffusion training strategy that incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness. Additionally, we present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions. Extensive experiments across multiple tasks and benchmarks demonstrate that UniAlignment outperforms existing baselines, underscoring the significant potential of diffusion models in unified multimodal generation.
Problem

Research questions and friction points this paper is trying to address.

Achieving unified multimodal generation across image tasks
Enhancing cross-modal semantic alignment in diffusion models
Addressing fragmented architectures in vision-language model integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal generation framework using diffusion transformer
Dual-stream diffusion training for semantic alignment
SemGen-Bench benchmark for multimodal consistency evaluation
X
Xinyang Song
School of Artificial Intelligence, University of Chinese Academy of Sciences
L
Libin Wang
AntGroup
W
Weining Wang
Institute of Automation, Chinese Academy of Sciences
S
Shaozhen Liu
Institute of Automation, Chinese Academy of Sciences
D
Dandan Zheng
AntGroup
J
Jingdong Chen
AntGroup
Q
Qi Li
School of Artificial Intelligence, University of Chinese Academy of Sciences
Zhenan Sun
Zhenan Sun
Institute of Automation, Chinese Academy of Sciences
BiometricsPattern RecognitionComputer Vision