🤖 AI Summary
To address rigid cross-modal interaction and high computational overhead in multimodal diffusion models, this paper proposes MoS (Modality-fusion via State routing), a novel state-fusion-based multimodal diffusion paradigm. Its core innovation is a timestep- and input-dependent token-level sparse router that dynamically selects and fuses hidden states from text and image modalities, enabling fine-grained feature and trajectory alignment during denoising. Leveraging top-k selection and an ε-greedy training strategy, MoS incurs negligible additional parameters and computational cost. On text-to-image generation and editing tasks, MoS achieves state-of-the-art performance with only 3B–5B parameters—significantly outperforming baseline models with up to four times the parameter count. This demonstrates MoS’s superior efficiency, scalability, and strong generalization capability across diverse multimodal diffusion scenarios.
📝 Abstract
We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-$k$ hidden states and is trained with an $ε$-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to $4 imes$ larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.