Engagement Prediction of Short Videos with Large Multimodal Models

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of short-video user engagement prediction. We propose a joint multimodal modeling approach leveraging large multimodal models—specifically VideoLLaMA2 and Qwen2.5-VL—to integrate keyframe visual features, textual metadata, and audio representations for cross-modal semantic understanding. To our knowledge, this is the first empirical validation of large multimodal models (LMMs) for engagement prediction, revealing that audio modality provides a critical performance boost. Furthermore, multi-model ensemble substantially improves both robustness and accuracy. The method is trained and optimized on the SnapUGC dataset and ranked first in the ICCV VQualA 2025 EVQA-SnapUGC Challenge, outperforming all existing approaches. These results demonstrate the superiority and practical viability of LMMs for user engagement prediction in short-video platforms.

Technology Category

Application Category

📝 Abstract
The rapid proliferation of user-generated content (UGC) on short-form video platforms has made video engagement prediction increasingly important for optimizing recommendation systems and guiding content creation. However, this task remains challenging due to the complex interplay of factors such as semantic content, visual quality, audio characteristics, and user background. Prior studies have leveraged various types of features from different modalities, such as visual quality, semantic content, background sound, etc., but often struggle to effectively model their cross-feature and cross-modality interactions. In this work, we empirically investigate the potential of large multimodal models (LMMs) for video engagement prediction. We adopt two representative LMMs: VideoLLaMA2, which integrates audio, visual, and language modalities, and Qwen2.5-VL, which models only visual and language modalities. Specifically, VideoLLaMA2 jointly processes key video frames, text-based metadata, and background sound, while Qwen2.5-VL utilizes only key video frames and text-based metadata. Trained on the SnapUGC dataset, both models demonstrate competitive performance against state-of-the-art baselines, showcasing the effectiveness of LMMs in engagement prediction. Notably, VideoLLaMA2 consistently outperforms Qwen2.5-VL, highlighting the importance of audio features in engagement prediction. By ensembling two types of models, our method achieves first place in the ICCV VQualA 2025 EVQA-SnapUGC Challenge on short-form video engagement prediction. The code is available at https://github.com/sunwei925/LMM-EVQA.git.
Problem

Research questions and friction points this paper is trying to address.

Predicting engagement for short videos using multimodal models
Modeling cross-feature and cross-modality interactions effectively
Evaluating audio's role in video engagement prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses large multimodal models for engagement prediction
Integrates audio, visual, and language modalities
Ensembles models to enhance prediction accuracy
W
Wei Sun
East China Normal University
Linhan Cao
Linhan Cao
Shanghai Jiao Tong University
Image Quality Assessment Video Quality Assessment
Yuqin Cao
Yuqin Cao
Shanghai Jiao Tong University
W
Weixia Zhang
East China Normal University
W
Wen Wen
City University of Hong Kong
K
Kaiwei Zhang
Shanghai Jiao Tong University
Zijian Chen
Zijian Chen
Shanghai Jiao Tong University | Shanghai AI Laboratory
Image/Video Quality AssessmentLarge Multi-modal Models
F
Fangfang Lu
Shanghai University of Electric Power
X
Xiongkuo Min
Shanghai Jiao Tong University
Guangtao Zhai
Guangtao Zhai
Professor, IEEE Fellow, Shanghai Jiao Tong University
Multimedia Signal ProcessingVisual Quality AssessmentQoEAI EvaluationDisplays