🤖 AI Summary
This work addresses the problem of short-video user engagement prediction. We propose a joint multimodal modeling approach leveraging large multimodal models—specifically VideoLLaMA2 and Qwen2.5-VL—to integrate keyframe visual features, textual metadata, and audio representations for cross-modal semantic understanding. To our knowledge, this is the first empirical validation of large multimodal models (LMMs) for engagement prediction, revealing that audio modality provides a critical performance boost. Furthermore, multi-model ensemble substantially improves both robustness and accuracy. The method is trained and optimized on the SnapUGC dataset and ranked first in the ICCV VQualA 2025 EVQA-SnapUGC Challenge, outperforming all existing approaches. These results demonstrate the superiority and practical viability of LMMs for user engagement prediction in short-video platforms.
📝 Abstract
The rapid proliferation of user-generated content (UGC) on short-form video platforms has made video engagement prediction increasingly important for optimizing recommendation systems and guiding content creation. However, this task remains challenging due to the complex interplay of factors such as semantic content, visual quality, audio characteristics, and user background. Prior studies have leveraged various types of features from different modalities, such as visual quality, semantic content, background sound, etc., but often struggle to effectively model their cross-feature and cross-modality interactions. In this work, we empirically investigate the potential of large multimodal models (LMMs) for video engagement prediction. We adopt two representative LMMs: VideoLLaMA2, which integrates audio, visual, and language modalities, and Qwen2.5-VL, which models only visual and language modalities. Specifically, VideoLLaMA2 jointly processes key video frames, text-based metadata, and background sound, while Qwen2.5-VL utilizes only key video frames and text-based metadata. Trained on the SnapUGC dataset, both models demonstrate competitive performance against state-of-the-art baselines, showcasing the effectiveness of LMMs in engagement prediction. Notably, VideoLLaMA2 consistently outperforms Qwen2.5-VL, highlighting the importance of audio features in engagement prediction. By ensembling two types of models, our method achieves first place in the ICCV VQualA 2025 EVQA-SnapUGC Challenge on short-form video engagement prediction. The code is available at https://github.com/sunwei925/LMM-EVQA.git.