Resolving Task Objective Conflicts in Unified Multimodal Understanding and Generation via Task-Aware Mixture-of-Experts

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Unified multimodal large language models face an inherent objective conflict between understanding tasks—emphasizing high-level semantic abstraction—and generation tasks—prioritizing fine-grained fidelity. Method: We propose UTAMoE, a task-aware Mixture-of-Experts (MoE) framework that introduces the first task-aware MoE layer *within* an autoregressive Transformer, enabling structural decoupling of understanding and generation pathways. It employs a two-stage collaborative training strategy to balance task specificity and model-wide consistency, incorporating module-level decoupling, multimodal joint fine-tuning, and visual attribution analysis. Contribution/Results: UTAMoE achieves state-of-the-art performance on major multimodal understanding and generation benchmarks, significantly mitigating cross-task interference. Ablation studies and visualization-based attribution analysis empirically validate both the effectiveness and interpretability of the pathway separation.

Technology Category

Application Category

📝 Abstract
Unified multimodal large language models (MLLMs) based on end-to-end autoregressive (AR) transformers effectively integrate both understanding and generation tasks within a single framework. However, intrinsic Task Objective Conflicts between high-level semantic abstraction in understanding and fine-grained detail preservation in generation pose significant challenges, often leading to suboptimal trade-offs and task interference. Existing solutions, such as decoupling shared visual encoders, fall short of fundamentally resolving these conflicts due to inherent AR architecture. In this paper, we propose a novel approach that decouples internal components of AR to resolve task objective conflicts. Specifically, we design UTAMoE, a Unified Task-Aware Mixture-of-Experts (MoE) framework that decouples internal AR modules via a Task-Aware MoE Layer to create task-specific optimization subpaths. To enhance task differentiation while maintaining overall coordination, we introduce a novel Two-Stage Training Strategy. Extensive experiments on multimodal benchmarks demonstrate that UTAMoE mitigates task objective conflicts, achieving state-of-the-art performance across various tasks. Visualizations and ablation studies further validate the effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Resolving task objective conflicts in multimodal models
Decoupling autoregressive modules for task-specific optimization
Improving performance in understanding and generation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-Aware Mixture-of-Experts (MoE) framework
Decouples AR modules for task-specific paths
Two-Stage Training Strategy for coordination
🔎 Similar Papers
No similar papers found.