Mixture of States: Routing Token-Level Dynamics for Multimodal Generation

📅 2025-11-15

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

To address rigid cross-modal interaction and high computational overhead in multimodal diffusion models, this paper proposes MoS (Modality-fusion via State routing), a novel state-fusion-based multimodal diffusion paradigm. Its core innovation is a timestep- and input-dependent token-level sparse router that dynamically selects and fuses hidden states from text and image modalities, enabling fine-grained feature and trajectory alignment during denoising. Leveraging top-k selection and an ε-greedy training strategy, MoS incurs negligible additional parameters and computational cost. On text-to-image generation and editing tasks, MoS achieves state-of-the-art performance with only 3B–5B parameters—significantly outperforming baseline models with up to four times the parameter count. This demonstrates MoS’s superior efficiency, scalability, and strong generalization capability across diverse multimodal diffusion scenarios.

Technology Category

Application Category

📝 Abstract

We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-$k$ hidden states and is trained with an $ε$-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to $4 imes$ larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.

Problem

Research questions and friction points this paper is trying to address.

Develops token-level routing for multimodal fusion in diffusion models

Enables precise alignment between token features and denoising trajectory

Achieves state-of-the-art performance with minimal computational overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-wise router enables sparse top-k state selection

Epsilon-greedy training minimizes parameters and computational overhead

Mixture of States paradigm aligns token features with diffusion trajectory

🔎 Similar Papers

No similar papers found.