UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing diffusion models excel at text-to-image generation but struggle to generalize to multimodal understanding, editing, and perception tasks—typically relying on separate vision-language models or modular architectures, leading to semantic fragmentation and computational inefficiency. To address this, we propose UniDiffusion: the first unified multimodal framework built upon a single diffusion Transformer. It employs a dual-stream diffusion training paradigm that jointly optimizes intrinsic modality reconstruction (e.g., image denoising) and cross-modal alignment (e.g., image-text matching), enabling end-to-end integration of generation, understanding, editing, and perception. Key innovations include a novel cross-modal semantic alignment loss and an end-to-end multi-task learning objective. We further introduce SemGen-Bench, a dedicated benchmark for evaluating multimodal generative understanding. Experiments demonstrate consistent and significant improvements over state-of-the-art methods across diverse multimodal tasks, validating the feasibility and superiority of diffusion-based unified multimodal intelligence.

Technology Category

Application Category

📝 Abstract

The remarkable success of diffusion models in text-to-image generation has sparked growing interest in expanding their capabilities to a variety of multi-modal tasks, including image understanding, manipulation, and perception. These tasks require advanced semantic comprehension across both visual and textual modalities, especially in scenarios involving complex semantic instructions. However, existing approaches often rely heavily on vision-language models (VLMs) or modular designs for semantic guidance, leading to fragmented architectures and computational inefficiency. To address these challenges, we propose UniAlignment, a unified multimodal generation framework within a single diffusion transformer. UniAlignment introduces a dual-stream diffusion training strategy that incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness. Additionally, we present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions. Extensive experiments across multiple tasks and benchmarks demonstrate that UniAlignment outperforms existing baselines, underscoring the significant potential of diffusion models in unified multimodal generation.

Problem

Research questions and friction points this paper is trying to address.

Achieving unified multimodal generation across image tasks

Enhancing cross-modal semantic alignment in diffusion models

Addressing fragmented architectures in vision-language model integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal generation framework using diffusion transformer

Dual-stream diffusion training for semantic alignment

SemGen-Bench benchmark for multimodal consistency evaluation

🔎 Similar Papers

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models

2024-05-16Citations: 2

Authors to Follow