Omni-SILA: Towards Omni-scene Driven Visual Sentiment Identifying, Locating and Attributing in Videos

πŸ“… 2025-02-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses video-based visual emotion understanding by proposing Omni-SILAβ€”the first unified task jointly modeling explicit cues (e.g., facial expressions) and implicit scene cues (e.g., actions, object relations, background) for integrated emotion recognition, spatiotemporal localization, and attribution explanation. Methodologically, we introduce the Implicit-enhanced Causal Mixture-of-Experts (ICM) architecture, comprising a Scene-Balanced MoE (SBM) and an Implicit-Enhanced Causal (IEC) module, which alleviates reliance on explicit cues via implicit-aware representation learning, multimodal MoE routing, causal inference, and video-temporal modeling. We construct the dual-track Omni-SILA dataset with fine-grained explicit/implicit annotations. Experiments demonstrate that our approach outperforms state-of-the-art Video-LLMs by 12.7% in emotion attribution accuracy and 9.3% in localization mAP.

Technology Category

Application Category

πŸ“ Abstract
Prior studies on Visual Sentiment Understanding (VSU) primarily rely on the explicit scene information (e.g., facial expression) to judge visual sentiments, which largely ignore implicit scene information (e.g., human action, objection relation and visual background), while such information is critical for precisely discovering visual sentiments. Motivated by this, this paper proposes a new Omni-scene driven visual Sentiment Identifying, Locating and Attributing in videos (Omni-SILA) task, aiming to interactively and precisely identify, locate and attribute visual sentiments through both explicit and implicit scene information. Furthermore, this paper believes that this Omni-SILA task faces two key challenges: modeling scene and highlighting implicit scene beyond explicit. To this end, this paper proposes an Implicit-enhanced Causal MoE (ICM) approach for addressing the Omni-SILA task. Specifically, a Scene-Balanced MoE (SBM) and an Implicit-Enhanced Causal (IEC) blocks are tailored to model scene information and highlight the implicit scene information beyond explicit, respectively. Extensive experimental results on our constructed explicit and implicit Omni-SILA datasets demonstrate the great advantage of the proposed ICM approach over advanced Video-LLMs.
Problem

Research questions and friction points this paper is trying to address.

Identifies visual sentiments using explicit and implicit scene information
Locates and attributes sentiments in videos beyond explicit cues
Models and highlights implicit scene information for precise sentiment analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Omni-scene driven sentiment analysis in videos
Implicit-enhanced Causal MoE for scene modeling
Scene-Balanced MoE and Implicit-Enhanced Causal blocks
J
Jiamin Luo
School of Computer Science and Technology, Soochow University, Suzhou, China
Jingjing Wang
Jingjing Wang
Professor, School of Cyber Science and Technology, Beihang University
AI for WirelessUAV NetworksSpace-Air-Ground-Sea NetworksCommunication Security
J
Junxiao Ma
School of Computer Science and Technology, Soochow University, Suzhou, China
Y
Yujie Jin
School of Computer Science and Technology, Soochow University, Suzhou, China
Shoushan Li
Shoushan Li
Soochow University
Natural Language ProcessingSentiment AnalysisMachine Learning
Guodong Zhou
Guodong Zhou
Soochow University, China
Natural Language ProcessingArtificial Intelligence