Architecture Decoupling Is Not All You Need For Unified Multimodal Model

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses performance degradation in unified multimodal models caused by objective conflicts between image generation and understanding tasks. We propose a model-agnostic co-optimization framework that avoids architectural decoupling—thereby preserving the model’s capacity for interleaved text-image generation. Our core innovation is the Attention Interaction Alignment (AIA) loss, which explicitly models task-specific cross-modal interaction patterns by quantitatively analyzing cross-attention behavior, and constrains cross-modal attention distributions during both supervised fine-tuning and post-training phases. Evaluated on state-of-the-art unified architectures—including Emu3 and Janus-Pro—our method achieves consistent improvements: +2.1% average accuracy on image understanding benchmarks and −8.3% reduction in FID score for image generation quality. Crucially, it introduces no additional parameters or inference overhead.

Technology Category

Application Category

📝 Abstract
Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of model decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling alleviates conflicts by studying the cross-modal attention behavior of models. We observe that model decoupling essentially drives models toward task-specific multimodal interaction patterns, as seen in Qwen-VL and HunyuanImage, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns Task-Specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.
Problem

Research questions and friction points this paper is trying to address.

Mitigate task conflicts in unified multimodal models
Avoid excessive model decoupling to preserve interleave generation
Enhance both generation and understanding performance via attention alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention Interaction Alignment loss mitigates task conflicts
Learns task-specific multimodal interaction patterns explicitly
Applied to models like Emu3 and Janus-Pro
🔎 Similar Papers
No similar papers found.