π€ AI Summary
This work addresses the challenge of accurately localizing and explaining affective moments in long videos based on ambiguous natural language queriesβa task where existing methods fall short. We introduce, for the first time, the Video Query-driven Affective Understanding (VQAU) task, along with VQAU-Bench, a unified benchmark encompassing temporal localization, emotion labeling, and evidential explanations. To tackle this problem, we propose AffectSeek, a multi-agent collaborative framework that enables end-to-end interpretable affective understanding through intent parsing, candidate moment localization, cross-stage verification, and reasoning. Experimental results demonstrate that current vision-language models exhibit limited performance on this task, whereas AffectSeek significantly advances state-of-the-art capabilities in affective moment localization, classification, and explanation, offering an effective solution for emotion-aware interaction in long-form video content.
π Abstract
Existing affective understanding studies have mainly focused on recognizing emotions from images, audio signals, or pre-cliped video clips, where the affective evidence is already given. This passive and clip-centered setting does not fully reflect real-world scenarios, in which users often interact with long videos and express their needs through natural-language queries. In this paper, we study \textbf{Vague-Query-driven video Affective Understanding (VQAU)}, a new task that requires models to localize affective moments in long videos, predict their emotion categories, and generate evidence-grounded rationales under vague user queries. To support this task, we construct \textbf{VQAU-Bench}, a benchmark that integrates long videos, vague affective queries, temporal clip annotations, emotion labels, and rationale explanations into a unified evaluation framework. VQAU-Bench enables systematic assessment of semantic-temporal-affective alignment, affective moment localization, emotion classification, and rationale generation. To address the multi-step reasoning challenges of VQAU, we further propose \textbf{AffectSeek}, an agentic framework that actively seeks, verifies, and explains affective moments in long videos. AffectSeek decomposes VQAU into intent interpretation, candidate localization, clip verification, emotion reasoning, and rationale generation, and progressively aligns vague user intent with long-video evidence through role-specialized reasoning and cross-stage verification. Experiments show that VQAU remains challenging for existing affective recognition models and single-step vision-language models, while AffectSeek provides a simple yet effective framework for agentic long-video affective understanding.