🤖 AI Summary
This work addresses the issue of expert homogenization in conventional Mixture-of-Experts (MoE) models for video understanding, which hinders effective spatiotemporal feature modeling. To overcome this limitation, the authors propose a functionally specialized heterogeneous MoE architecture that leverages content-aware multi-rate video sampling and a bidirectional dynamic feature fusion mechanism. This design encourages expert specialization and enables efficient transfer learning from image to video domains. The proposed method achieves state-of-the-art performance across multiple video recognition benchmarks, significantly enhancing both model representational capacity and expert diversity.
📝 Abstract
With the rapid development of pre-training technologies, adapting large-scale Vision-Language Models (VLMs) for video understanding \emph{\ie} image-to-video transfer learning has become a dominant paradigm. To achieve superior performance, it raises as an effective strategy among recent advances to employ Mixture-of-Experts (MoE) to enhance VLMs' temporal modeling capabilities. However, conventional MoE designs suffer from expert homogenization, where all experts act as identical generalists, inefficiently learning spatio-temporal features from undifferentiated video streams. To overcome this problem, we propose VidPrism, a novel heterogeneous temporal Mixture-of-Experts framework. VidPrism pioneers a division of labor by deploying functionally specialized experts, each assuming a role ranging from spatial understanding to temporal modeling. To feed these specialists appropriately, we introduce a content-aware, multi-rate sampling module that dynamically generates streams ranging from semantically rich to motion-focused representations, providing specialized inputs for experts. Furthermore, a dynamic, bidirectional fusion mechanism enables synergistic information exchange between these pathways, leading to a comprehensive video representation. Extensive experiments on various video recognition benchmarks demonstrate that VidPrism achieves state-of-the-art performance and effectively fosters expert specialization. Our source code is available at \href{https://github.com/Lrrrr549/VidPrism.git}{https://github.com/Lrrrr549/VidPrism.git}.