🤖 AI Summary
Short-form video (SV) sentiment analysis faces challenges including data scarcity, large modality semantic gaps, and local biases induced by audiovisual co-expression. To address these, we introduce eMotions—the first large-scale Chinese SV sentiment annotation dataset—and propose AV-CANet, an end-to-end audiovisual fusion network. AV-CANet incorporates a local-global dynamic feature fusion module to model cross-modal temporal dependencies, employs an EP-CE tri-polar loss to sharpen sentiment discrimination boundaries, and leverages video Transformers with attention mechanisms for robust representation learning. A multi-stage annotation strategy significantly mitigates subjective bias. Experiments demonstrate that AV-CANet achieves state-of-the-art performance on eMotions and four public benchmarks; ablation studies validate the efficacy of each component. This work provides both a high-quality, community-accessible dataset and a reproducible, strong baseline for SV sentiment analysis.
📝 Abstract
Short-form videos (SVs) have become a vital part of our online routine for acquiring and sharing information. Their multimodal complexity poses new challenges for video analysis, highlighting the need for video emotion analysis (VEA) within the community. Given the limited availability of SVs emotion data, we introduce eMotions, a large-scale dataset consisting of 27,996 videos with full-scale annotations. To ensure quality and reduce subjective bias, we emphasize better personnel allocation and propose a multi-stage annotation procedure. Additionally, we provide the category-balanced and test-oriented variants through targeted sampling to meet diverse needs. While there have been significant studies on videos with clear emotional cues (e.g., facial expressions), analyzing emotions in SVs remains a challenging task. The challenge arises from the broader content diversity, which introduces more distinct semantic gaps and complicates the representations learning of emotion-related features. Furthermore, the prevalence of audio-visual co-expressions in SVs leads to the local biases and collective information gaps caused by the inconsistencies in emotional expressions. To tackle this, we propose AV-CANet, an end-to-end audio-visual fusion network that leverages video transformer to capture semantically relevant representations. We further introduce the Local-Global Fusion Module designed to progressively capture the correlations of audio-visual features. Besides, EP-CE Loss is constructed to globally steer optimizations with tripolar penalties. Extensive experiments across three eMotions-related datasets and four public VEA datasets demonstrate the effectiveness of our proposed AV-CANet, while providing broad insights for future research. Moreover, we conduct ablation studies to examine the critical components of our method. Dataset and code will be made available at Github.