๐ค AI Summary
Existing diffusion models excel at text-to-image generation but struggle to generalize to multimodal understanding, editing, and perception tasksโtypically relying on separate vision-language models or modular architectures, leading to semantic fragmentation and computational inefficiency. To address this, we propose UniDiffusion: the first unified multimodal framework built upon a single diffusion Transformer. It employs a dual-stream diffusion training paradigm that jointly optimizes intrinsic modality reconstruction (e.g., image denoising) and cross-modal alignment (e.g., image-text matching), enabling end-to-end integration of generation, understanding, editing, and perception. Key innovations include a novel cross-modal semantic alignment loss and an end-to-end multi-task learning objective. We further introduce SemGen-Bench, a dedicated benchmark for evaluating multimodal generative understanding. Experiments demonstrate consistent and significant improvements over state-of-the-art methods across diverse multimodal tasks, validating the feasibility and superiority of diffusion-based unified multimodal intelligence.
๐ Abstract
The remarkable success of diffusion models in text-to-image generation has sparked growing interest in expanding their capabilities to a variety of multi-modal tasks, including image understanding, manipulation, and perception. These tasks require advanced semantic comprehension across both visual and textual modalities, especially in scenarios involving complex semantic instructions. However, existing approaches often rely heavily on vision-language models (VLMs) or modular designs for semantic guidance, leading to fragmented architectures and computational inefficiency. To address these challenges, we propose UniAlignment, a unified multimodal generation framework within a single diffusion transformer. UniAlignment introduces a dual-stream diffusion training strategy that incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness. Additionally, we present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions. Extensive experiments across multiple tasks and benchmarks demonstrate that UniAlignment outperforms existing baselines, underscoring the significant potential of diffusion models in unified multimodal generation.