UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work identifies a fundamental conflict in unified multimodal models: understanding tasks require progressively strengthened cross-modal alignment across network depth to build semantics, whereas generation tasks necessitate shallow alignment and deep disentanglement to preserve spatial fidelity. To resolve this, we propose UniFork—a Y-shaped architecture featuring a shared shallow encoder for generic cross-modal representation learning and task-specific deep branches that explicitly decouple alignment dynamics per task. Guided by modality alignment behavior analysis and task-aware depth-wise disentanglement design, we validate UniFork’s effectiveness via multi-stage ablation studies. On diverse understanding and generation benchmarks, UniFork surpasses fully shared Transformers and matches or exceeds single-task expert models—achieving bidirectional state-of-the-art performance within a single unified architecture.

Technology Category

Application Category

📝 Abstract

Unified image understanding and generation has emerged as a promising paradigm in multimodal artificial intelligence. Despite recent progress, the optimal architectural design for such unified models remains an open challenge. In this work, we start by analyzing the modality alignment behaviors of task-specific expert models for understanding and generation, as well as current unified models. Our analysis reveals a crucial observation: understanding tasks benefit from a progressively increasing modality alignment across network depth, which helps build up semantic information for better comprehension; In contrast, generation tasks follow a different trend: modality alignment increases in the early layers but decreases in the deep layers to recover spatial details. These divergent alignment patterns create a fundamental conflict in fully shared Transformer backbones, where a uniform representational flow often leads to performance compromises across two tasks. Motivated by this finding, we introduce UniFork, a novel Y-shaped architecture that shares the shallow layers for cross-task representation learning, while employing task-specific branches in deeper layers to avoid task interference. This design effectively balances shared learning and task specialization. Through extensive ablation experiments, we demonstrate that Unifork consistently outperforms conventional fully shared Transformer architectures, and achieves performance on par with or better than task-specific models.

Problem

Research questions and friction points this paper is trying to address.

Exploring modality alignment for unified multimodal understanding and generation

Resolving conflicting alignment patterns in shared Transformer backbones

Designing Y-shaped architecture to balance shared learning and task specialization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Y-shaped architecture for task-specific branches

Shared shallow layers for cross-task learning

Modality alignment analysis for unified models

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

Scientist / Senior Scientist, Multimodal AI

Altos Labs

Scientist I, Machine Learning: $200,900 - $257,500 Scientist II, Machine Learning: $226,200 - $290,000 Senior Scientist I, Machine Learning: $257,400 - $330,000 Scientist I, Machine Learning: $179,400 - $230,000 Scientist II, Machine Learning: $212,900 - $273,000 Senior Scientist I, Machine Learning: $239,500 - $307,000

Redwood City, CA / San Diego, CA

Authors to Follow