VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the issue of expert homogenization in conventional Mixture-of-Experts (MoE) models for video understanding, which hinders effective spatiotemporal feature modeling. To overcome this limitation, the authors propose a functionally specialized heterogeneous MoE architecture that leverages content-aware multi-rate video sampling and a bidirectional dynamic feature fusion mechanism. This design encourages expert specialization and enables efficient transfer learning from image to video domains. The proposed method achieves state-of-the-art performance across multiple video recognition benchmarks, significantly enhancing both model representational capacity and expert diversity.
📝 Abstract
With the rapid development of pre-training technologies, adapting large-scale Vision-Language Models (VLMs) for video understanding \emph{\ie} image-to-video transfer learning has become a dominant paradigm. To achieve superior performance, it raises as an effective strategy among recent advances to employ Mixture-of-Experts (MoE) to enhance VLMs' temporal modeling capabilities. However, conventional MoE designs suffer from expert homogenization, where all experts act as identical generalists, inefficiently learning spatio-temporal features from undifferentiated video streams. To overcome this problem, we propose VidPrism, a novel heterogeneous temporal Mixture-of-Experts framework. VidPrism pioneers a division of labor by deploying functionally specialized experts, each assuming a role ranging from spatial understanding to temporal modeling. To feed these specialists appropriately, we introduce a content-aware, multi-rate sampling module that dynamically generates streams ranging from semantically rich to motion-focused representations, providing specialized inputs for experts. Furthermore, a dynamic, bidirectional fusion mechanism enables synergistic information exchange between these pathways, leading to a comprehensive video representation. Extensive experiments on various video recognition benchmarks demonstrate that VidPrism achieves state-of-the-art performance and effectively fosters expert specialization. Our source code is available at \href{https://github.com/Lrrrr549/VidPrism.git}{https://github.com/Lrrrr549/VidPrism.git}.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
image-to-video transfer
expert homogenization
video understanding
spatio-temporal features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
heterogeneous experts
image-to-video transfer
temporal modeling
content-aware sampling
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30