eMotions: A Large-Scale Dataset and Audio-Visual Fusion Network for Emotion Analysis in Short-form Videos

📅 2025-08-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Short-form video (SV) sentiment analysis faces challenges including data scarcity, large modality semantic gaps, and local biases induced by audiovisual co-expression. To address these, we introduce eMotions—the first large-scale Chinese SV sentiment annotation dataset—and propose AV-CANet, an end-to-end audiovisual fusion network. AV-CANet incorporates a local-global dynamic feature fusion module to model cross-modal temporal dependencies, employs an EP-CE tri-polar loss to sharpen sentiment discrimination boundaries, and leverages video Transformers with attention mechanisms for robust representation learning. A multi-stage annotation strategy significantly mitigates subjective bias. Experiments demonstrate that AV-CANet achieves state-of-the-art performance on eMotions and four public benchmarks; ablation studies validate the efficacy of each component. This work provides both a high-quality, community-accessible dataset and a reproducible, strong baseline for SV sentiment analysis.

Technology Category

Application Category

📝 Abstract
Short-form videos (SVs) have become a vital part of our online routine for acquiring and sharing information. Their multimodal complexity poses new challenges for video analysis, highlighting the need for video emotion analysis (VEA) within the community. Given the limited availability of SVs emotion data, we introduce eMotions, a large-scale dataset consisting of 27,996 videos with full-scale annotations. To ensure quality and reduce subjective bias, we emphasize better personnel allocation and propose a multi-stage annotation procedure. Additionally, we provide the category-balanced and test-oriented variants through targeted sampling to meet diverse needs. While there have been significant studies on videos with clear emotional cues (e.g., facial expressions), analyzing emotions in SVs remains a challenging task. The challenge arises from the broader content diversity, which introduces more distinct semantic gaps and complicates the representations learning of emotion-related features. Furthermore, the prevalence of audio-visual co-expressions in SVs leads to the local biases and collective information gaps caused by the inconsistencies in emotional expressions. To tackle this, we propose AV-CANet, an end-to-end audio-visual fusion network that leverages video transformer to capture semantically relevant representations. We further introduce the Local-Global Fusion Module designed to progressively capture the correlations of audio-visual features. Besides, EP-CE Loss is constructed to globally steer optimizations with tripolar penalties. Extensive experiments across three eMotions-related datasets and four public VEA datasets demonstrate the effectiveness of our proposed AV-CANet, while providing broad insights for future research. Moreover, we conduct ablation studies to examine the critical components of our method. Dataset and code will be made available at Github.
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale emotion datasets for short-form videos
Challenges in analyzing emotions due to diverse content and semantic gaps
Inconsistencies in audio-visual emotional expressions causing local biases
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset eMotions with multi-stage annotation
AV-CANet: audio-visual fusion with video transformer
Local-Global Fusion Module for feature correlations
X
Xuecheng Wu
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, 710049, China
Dingkang Yang
Dingkang Yang
ByteDance
Multimodal LearningGenerative AIEmbodied AI
D
Danlei Huang
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, 710049, China
X
Xinyi Yin
School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou, 450002, China
Y
Yifan Wang
Institute of Advanced Technology, University of Science and Technology of China, Hefei, 230031, China
J
Jia Zhang
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, 710049, China
J
Jiayu Nie
Inspur Electronic Information Industry Co., Ltd, Jinan, 250101, China
L
Liangyu Fu
School of Software, Northwestern Polytechnical University, Xi’an, 710072, China
Y
Yang Liu
Department of Computer Science, The University of Toronto, Toronto, ONM5S1A1, Canada
Junxiao Xue
Junxiao Xue
Zhejiang Lab
Computer GraphicsCrowd simulationMulti-agents ModelingMulti-modal Learning
Hadi Amirpour
Hadi Amirpour
University of Klagenfurt
Video CompressionQuality of ExperienceVideo StreamingMedical Image Processing
W
Wei Zhou
School of Computer Science and Informatics, Cardiff University, Cardiff, CF24 4AG, United Kingdom