🤖 AI Summary
This work addresses the challenge of imprecise boundary localization of emotional segments in videos under point-level weak supervision. To this end, we propose FSENet, a novel framework that introduces fine-grained facial features to guide multimodal emotion localization for the first time. The method effectively integrates facial cues with multimodal contextual information through three key components: facial-guided emotion discovery, point-aware semantic contrastive learning, and boundary-aware pseudo-label generation. Extensive experiments demonstrate that FSENet achieves state-of-the-art performance across various weakly supervised settings, significantly improving the accuracy, generalization, and robustness of emotional boundary detection.
📝 Abstract
Point-level weakly-supervised temporal sentiment localization (P-WTSL) aims to detect sentiment-relevant segments in untrimmed multimodal videos using timestamp sentiment annotations, which greatly reduces the costly frame-level labeling. To further tackle the challenges of imprecise sentiment boundaries in P-WTSL, we propose the Face-guided Sentiment Boundary Enhancement Network (\textbf{FSENet}), a unified framework that leverages fine-grained facial features to guide sentiment localization. Specifically, our approach \textit{first} introduces the Face-guided Sentiment Discovery (FSD) module, which integrates facial features into multimodal interaction via dual-branch modeling for effective sentiment stimuli clues; We \textit{then} propose the Point-aware Sentiment Semantics Contrast (PSSC) strategy to discriminate sentiment semantics of candidate points (frame-level) near annotation points via contrastive learning, thereby enhancing the model's ability to recognize sentiment boundaries. At \textit{last}, we design the Boundary-aware Sentiment Pseudo-label Generation (BSPG) approach to convert sparse point annotations into temporally smooth supervisory pseudo-labels. Extensive experiments and visualizations on the benchmark demonstrate the effectiveness of our framework, achieving state-of-the-art performance under full supervision, video-level, and point-level weak supervision, thereby showcasing the strong generalization ability of our FSENet across different annotation settings.