π€ AI Summary
This work addresses the catastrophic forgetting in language understanding tasks that often arises when large multimodal language models are endowed with image generation capabilities, primarily due to gradient conflicts between generative and discriminative objectives. To mitigate this issue, the authors propose a native multimodal mixture-of-experts (MoE) architecture that jointly optimizes generation and understanding within a unified pretraining framework. The approach leverages modality-aware expert decoupling, shared experts as cross-modal semantic bridges, differential learning rates, and early-stage gradient maskingβall without introducing any additional parameters. Experimental results demonstrate that the proposed method significantly enhances performance on language understanding benchmarks such as MMLU and OCRBench while simultaneously accelerating convergence in image generation tasks.
π Abstract
Empowering Large Multimodal Models (LMMs) with image generation often leads to catastrophic forgetting in understanding tasks due to severe gradient conflicts. While existing paradigms like Mixture-of-Transformers (MoT) mitigate this conflict through structural isolation, they fundamentally sever cross-modal synergy and suffer from capacity fragmentation. In this work, we present Symbiotic-MoE, a unified pre-training framework that resolves task interference within a native multimodal Mixture-of-Experts (MoE) Transformers architecture with zero-parameter overhead. We first identify that standard MoE tuning leads to routing collapse, where generative gradients dominate expert utilization. To address this, we introduce Modality-Aware Expert Disentanglement, which partitions experts into task-specific groups while utilizing shared experts as a multimodal semantic bridge. Crucially, this design allows shared experts to absorb fine-grained visual semantics from generative tasks to enrich textual representations. To optimize this, we propose a Progressive Training Strategy featuring differential learning rates and early-stage gradient shielding. This mechanism not only shields pre-trained knowledge from early volatility but eventually transforms generative signals into constructive feedback for understanding. Extensive experiments demonstrate that Symbiotic-MoE achieves rapid generative convergence while unlocking cross-modal synergy, boosting inherent understanding with remarkable gains on MMLU and OCRBench.