Understanding and Harnessing Sparsity in Unified Multimodal Models

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Unified multimodal models suffer from inefficient inference due to mandatory full-component activation, yet systematic characterization of sparsity patterns across modules remains lacking. Method: We conduct the first systematic analysis of sparsity disparities between understanding and generation modules, revealing that the former is highly compressible while the latter is sensitive to compression. Building on this insight, we propose a dynamic-activation-pattern-based Mixture-of-Experts (MoE) adaptation framework: (i) training-free pruning for sparsity probing, integrated with joint depth-width compression analysis; and (ii) a sparse activation mechanism enabling expert freezing for fine-tuning and full-parameter training. Contribution/Results: Evaluated on the BAGEL model, our method achieves full-model performance while activating only ~50% of parameters—yielding substantial inference speedup without compromising generation quality, thus balancing efficiency and fidelity.

Technology Category

Application Category

📝 Abstract

Large multimodal models have achieved remarkable progress in both understanding and generation. Recent efforts pursue unified multimodal models that integrate heterogeneous components to support both capabilities within a single framework. However, such unification introduces inference inefficiencies, e.g., specific tasks or samples may not require the full knowledge or capacity of the unified model. Yet, a systematic understanding of how these inefficiencies manifest across different components remains limited. In this work, we first conduct a systematic analysis of unified multimodal model components using training-free pruning as a probing methodology, considering both depth pruning and width reduction. Our study reveals that the understanding component exhibits notable compressibility in both understanding and generation tasks, which is more pronounced in the latter. In contrast, the generation components are highly sensitive to compression, with performance deteriorating sharply even under moderate compression ratios. To address this limitation, we propose the Mixture-of-Experts (MoE) Adaptation, inspired by the dynamic activation patterns observed across different samples. This approach partitions the generation module into multiple experts and enables sparse activation to restore generation quality. We validate the effectiveness of sparse activation through expert-frozen tuning and further demonstrate that a fully trainable adaptation delivers additional gains. As a result, the adapted BAGEL model achieves performance comparable to the full model while activating only about half of its parameters. The code is released at href{https://github.com/Shwai-He/SparseUnifiedModel}{this link}.

Problem

Research questions and friction points this paper is trying to address.

Unified multimodal models suffer from inference inefficiencies due to unnecessary full model usage

Generation components are highly sensitive to compression, causing sharp performance deterioration

Current approaches lack systematic understanding of inefficiency distribution across model components

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic analysis using training-free pruning for model components

Mixture-of-Experts adaptation enabling sparse activation in generation

Achieves full model performance with only half parameters activated

🔎 Similar Papers

Sparsely Multimodal Data Fusion