🤖 AI Summary
Current foundation models struggle to effectively identify contextually critical sub-events in videos—such as pivotal moments in a soccer match—limiting their capacity for multimodal event narration and summarization. This work addresses this gap by constructing, for the first time, a human preference dataset derived from readily available highlight videos, requiring no additional annotations, to evaluate the ability of mainstream multimodal models to distinguish important from non-important sub-events. Through systematic analysis of cross-modal fusion mechanisms, experiments reveal that existing models perform near random chance, primarily due to overreliance on individual modalities and insufficient cross-modal coordination. The findings uncover fundamental challenges in real-world video understanding and underscore the necessity of modular architectures and complementary training strategies to advance multimodal event comprehension.
📝 Abstract
Foundation models are used for many real-world applications involving language generation from temporally-ordered multimodal events. In this work, we study the ability of models to identify the most important sub-events in a video, which is a fundamental prerequisite for narrating or summarizing multimodal events. Specifically, we focus on football games and evaluate models on their ability to distinguish between important and non-important sub-events in a game. To this end, we construct a new dataset by leveraging human preferences for importance implicit in football game highlight reels, without any additional annotation costs. Using our dataset, which we will publicly release to the community, we compare several state-of-the-art multimodal models and show that they are not far from chance level performance. Analyses of models beyond standard evaluation metrics reveal their tendency to rely on a single dominant modality and their ineffectiveness in synthesizing necessary information from multiple sources. Our findings underline the importance of modular architectures that can handle sample-level heterogeneity in multimodal data and the need for complementary training procedures that can maximize cross-modal synergy.