Face-Guided Sentiment Boundary Enhancement for Weakly-Supervised Temporal Sentiment Localization

📅 2026-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of imprecise boundary localization of emotional segments in videos under point-level weak supervision. To this end, we propose FSENet, a novel framework that introduces fine-grained facial features to guide multimodal emotion localization for the first time. The method effectively integrates facial cues with multimodal contextual information through three key components: facial-guided emotion discovery, point-aware semantic contrastive learning, and boundary-aware pseudo-label generation. Extensive experiments demonstrate that FSENet achieves state-of-the-art performance across various weakly supervised settings, significantly improving the accuracy, generalization, and robustness of emotional boundary detection.

Technology Category

Application Category

📝 Abstract
Point-level weakly-supervised temporal sentiment localization (P-WTSL) aims to detect sentiment-relevant segments in untrimmed multimodal videos using timestamp sentiment annotations, which greatly reduces the costly frame-level labeling. To further tackle the challenges of imprecise sentiment boundaries in P-WTSL, we propose the Face-guided Sentiment Boundary Enhancement Network (\textbf{FSENet}), a unified framework that leverages fine-grained facial features to guide sentiment localization. Specifically, our approach \textit{first} introduces the Face-guided Sentiment Discovery (FSD) module, which integrates facial features into multimodal interaction via dual-branch modeling for effective sentiment stimuli clues; We \textit{then} propose the Point-aware Sentiment Semantics Contrast (PSSC) strategy to discriminate sentiment semantics of candidate points (frame-level) near annotation points via contrastive learning, thereby enhancing the model's ability to recognize sentiment boundaries. At \textit{last}, we design the Boundary-aware Sentiment Pseudo-label Generation (BSPG) approach to convert sparse point annotations into temporally smooth supervisory pseudo-labels. Extensive experiments and visualizations on the benchmark demonstrate the effectiveness of our framework, achieving state-of-the-art performance under full supervision, video-level, and point-level weak supervision, thereby showcasing the strong generalization ability of our FSENet across different annotation settings.
Problem

Research questions and friction points this paper is trying to address.

weakly-supervised learning
temporal sentiment localization
sentiment boundary
multimodal video
point-level annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Face-guided Sentiment Discovery
Point-aware Sentiment Semantics Contrast
Boundary-aware Pseudo-label Generation
Weakly-Supervised Temporal Localization
Multimodal Sentiment Analysis
🔎 Similar Papers
No similar papers found.
C
Cailing Han
Hefei University of Technology
Z
Zhangbin Li
Hefei University of Technology
J
Jinxing Zhou
MBZUAI
W
Wei Qian
Hefei University of Technology
J
Jingjing Hu
Hefei University of Technology
Y
Yanghao Zhou
NUS
Z
Zhangling Duan
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
Dan Guo
Dan Guo
IEEE senior member, Professor, Hefei University of Technology
Multimedia ComputingArtificial Intelligence