UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation

📅 2025-06-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a fundamental conflict in unified multimodal models: understanding tasks require progressively strengthened cross-modal alignment across network depth to build semantics, whereas generation tasks necessitate shallow alignment and deep disentanglement to preserve spatial fidelity. To resolve this, we propose UniFork—a Y-shaped architecture featuring a shared shallow encoder for generic cross-modal representation learning and task-specific deep branches that explicitly decouple alignment dynamics per task. Guided by modality alignment behavior analysis and task-aware depth-wise disentanglement design, we validate UniFork’s effectiveness via multi-stage ablation studies. On diverse understanding and generation benchmarks, UniFork surpasses fully shared Transformers and matches or exceeds single-task expert models—achieving bidirectional state-of-the-art performance within a single unified architecture.

Technology Category

Application Category

📝 Abstract
Unified image understanding and generation has emerged as a promising paradigm in multimodal artificial intelligence. Despite recent progress, the optimal architectural design for such unified models remains an open challenge. In this work, we start by analyzing the modality alignment behaviors of task-specific expert models for understanding and generation, as well as current unified models. Our analysis reveals a crucial observation: understanding tasks benefit from a progressively increasing modality alignment across network depth, which helps build up semantic information for better comprehension; In contrast, generation tasks follow a different trend: modality alignment increases in the early layers but decreases in the deep layers to recover spatial details. These divergent alignment patterns create a fundamental conflict in fully shared Transformer backbones, where a uniform representational flow often leads to performance compromises across two tasks. Motivated by this finding, we introduce UniFork, a novel Y-shaped architecture that shares the shallow layers for cross-task representation learning, while employing task-specific branches in deeper layers to avoid task interference. This design effectively balances shared learning and task specialization. Through extensive ablation experiments, we demonstrate that Unifork consistently outperforms conventional fully shared Transformer architectures, and achieves performance on par with or better than task-specific models.
Problem

Research questions and friction points this paper is trying to address.

Exploring modality alignment for unified multimodal understanding and generation
Resolving conflicting alignment patterns in shared Transformer backbones
Designing Y-shaped architecture to balance shared learning and task specialization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Y-shaped architecture for task-specific branches
Shared shallow layers for cross-task learning
Modality alignment analysis for unified models
🔎 Similar Papers
No similar papers found.