🤖 AI Summary
Existing multimodal world models struggle to effectively leverage the rich prior knowledge embedded in foundation models of individual modalities. To address this limitation, this work proposes M²-REPA, a novel approach that introduces, for the first time, a representation alignment mechanism tailored for multimodal video generation. The method decouples modality-specific features from intermediate representations of a diffusion model and aligns them separately with their corresponding foundation models. By integrating modality-decoupling regularization and multimodal alignment losses, M²-REPA enables collaborative optimization of multimodal representations. This approach significantly enhances both visual quality and long-term temporal consistency of generated videos, outperforming current state-of-the-art baselines.
📝 Abstract
Emerging multi-modal world models attempt to jointly generate videos across diverse modalities (e.g., RGB, depth, and mask), yet they fail to fully exploit the rich priors of existing foundation models. We propose $M^2$-REPA, the first representation alignment method tailored for multi-modal video generation. Our key insight is that foundation models trained on different modality spaces naturally capture distinct domain-specific priors, acting as complementary "experts." Specifically, we first decouple modality-specific features from the diffusion model's intermediate representations, then align each with its corresponding expert foundation model. To this end, we design two synergistic objectives: a multi-modal representation alignment loss that enforces feature-to-expert matching, and a modality-specific decoupling regularization that encourages complementarity across different modalities. This design enables joint optimization, fully exploiting priors from multiple foundation models. Extensive experiments demonstrate that our method significantly outperforms baselines in visual quality and long-term consistency.