Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

This work addresses the challenge of efficiently unifying multimodal understanding and generation within a single architecture while achieving high performance and low computational cost. To this end, the authors propose DiT-MoE, a novel architecture that integrates autoregressive and diffusion mechanisms and incorporates a fine-grained Mixture-of-Experts (MoE) module with 128 experts and Top-8 routing. This design substantially enhances model capacity and generation quality while activating only a small subset of parameters. Furthermore, the study introduces a pioneering joint strategy combining few-step distillation and reinforcement learning, reducing video editing inference to just four steps. Experiments demonstrate that the method achieves state-of-the-art performance among open-source models on VBench 2.0 and OpenVE-Bench, accelerates inference speed by up to 95.9×, and attains a 98% success rate in internal advertising applications.

📝 Abstract

We present Mamoda2.5, a unified AR-Diffusion framework that seamlessly integrates multimodal understanding and generation within a single architecture. To efficiently enhance the model's generation capability, we equip the Diffusion Transformer backbone with a fine-grained Mixture-of-Experts (MoE) design (128 experts, Top-8 routing), yielding a 25B-parameter model that activates only 3B parameters, significantly reducing training costs while scaling up the model capacity. Mamoda2.5 achieves top-tier generation performance on VBench 2.0 and sets a new record in video editing quality, surpassing evaluated open-source models and matching the performance of current top-tier proprietary models, including the Kling O1 on OpenVE-Bench. Furthermore, we introduce a joint few-step distillation and reinforcement learning framework that compresses the 30-step editing model into a 4-step model and greatly accelerates model inference. Compared to open-source baselines, Mamoda2.5 achieves up to $95.9\times$ faster video editing inference. In real-world applications, Mamoda2.5 has been successfully deployed for content moderation and creative restoration tasks in advertising scenarios, achieving a 98% success rate in internal advertising video editing scenario.

Problem

Research questions and friction points this paper is trying to address.

multimodal generation

video editing

model efficiency

unified architecture

inference acceleration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

Diffusion Transformer

AR-Diffusion