AffectSeek: Agentic Affective Understanding in Long Videos under Vague User Queries

πŸ“… 2026-05-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

210K/year
πŸ€– AI Summary
This work addresses the challenge of accurately localizing and explaining affective moments in long videos based on ambiguous natural language queriesβ€”a task where existing methods fall short. We introduce, for the first time, the Video Query-driven Affective Understanding (VQAU) task, along with VQAU-Bench, a unified benchmark encompassing temporal localization, emotion labeling, and evidential explanations. To tackle this problem, we propose AffectSeek, a multi-agent collaborative framework that enables end-to-end interpretable affective understanding through intent parsing, candidate moment localization, cross-stage verification, and reasoning. Experimental results demonstrate that current vision-language models exhibit limited performance on this task, whereas AffectSeek significantly advances state-of-the-art capabilities in affective moment localization, classification, and explanation, offering an effective solution for emotion-aware interaction in long-form video content.
πŸ“ Abstract
Existing affective understanding studies have mainly focused on recognizing emotions from images, audio signals, or pre-cliped video clips, where the affective evidence is already given. This passive and clip-centered setting does not fully reflect real-world scenarios, in which users often interact with long videos and express their needs through natural-language queries. In this paper, we study \textbf{Vague-Query-driven video Affective Understanding (VQAU)}, a new task that requires models to localize affective moments in long videos, predict their emotion categories, and generate evidence-grounded rationales under vague user queries. To support this task, we construct \textbf{VQAU-Bench}, a benchmark that integrates long videos, vague affective queries, temporal clip annotations, emotion labels, and rationale explanations into a unified evaluation framework. VQAU-Bench enables systematic assessment of semantic-temporal-affective alignment, affective moment localization, emotion classification, and rationale generation. To address the multi-step reasoning challenges of VQAU, we further propose \textbf{AffectSeek}, an agentic framework that actively seeks, verifies, and explains affective moments in long videos. AffectSeek decomposes VQAU into intent interpretation, candidate localization, clip verification, emotion reasoning, and rationale generation, and progressively aligns vague user intent with long-video evidence through role-specialized reasoning and cross-stage verification. Experiments show that VQAU remains challenging for existing affective recognition models and single-step vision-language models, while AffectSeek provides a simple yet effective framework for agentic long-video affective understanding.
Problem

Research questions and friction points this paper is trying to address.

Vague-Query-driven
Affective Understanding
Long Videos
Emotion Localization
Rationale Generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vague-Query-driven Affective Understanding
Long Video Emotion Recognition
Agentic Reasoning
AffectSeek
VQAU-Bench
Z
Zhen Zhang
Gansu Provincial Key Laboratory of Wearable Computing, School of Information Science and Engineering, Lanzhou University, Gansu 730000, China, and Guangdong-Hong Kong-Macao Joint Laboratory for Emotional Intelligence and Pervasive Computing, Shenzhen MSU-BIT University, Shenzhen 518107, China
Y
Yuhang Yang
Artificial Intelligence Research Institute, Shenzhen MSU-BIT University and Guangdong-Hong Kong-Macao Joint Laboratory for Emotional Intelligence and Pervasive Computing, Shenzhen MSU-BIT University, Shenzhen 518107, China
Y
Yunxiang Jiang
Artificial Intelligence Research Institute, Shenzhen MSU-BIT University and Guangdong-Hong Kong-Macao Joint Laboratory for Emotional Intelligence and Pervasive Computing, Shenzhen MSU-BIT University, Shenzhen 518107, China
Yuhuan Lu
Yuhuan Lu
IOTSC, University of Macau
Knowledge RepresentationLarge Language ModelsIntelligent Transportation Systems
H
Haifeng Lu
Artificial Intelligence Research Institute, Shenzhen MSU-BIT University, Shenzhen 518107, China, and the Department of Electrical and Computer Engineering, The University of Hong Kong, China
Zheng Lian
Zheng Lian
Associate Professor, IEEE/CCF Senior Member, Institute of Automation, Chinese Academy of Sciences
Affective ComputingSentiment AnalysisMachine Learning
R
Runhao Zeng
Artificial Intelligence Research Institute, Shenzhen MSU-BIT University and Guangdong-Hong Kong-Macao Joint Laboratory for Emotional Intelligence and Pervasive Computing, Shenzhen MSU-BIT University, Shenzhen 518107, China
Xiping Hu
Xiping Hu
Professor in Beijing Institute of Technology
Cyber-Physical SystemCrowd ComputingAffective Computing