StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models

📅 2024-08-31

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) prioritize video semantic understanding while neglecting audience affective response modeling, resulting in limited capability for emotion prediction and attribution. To address this, we propose the first audience-centric framework for video affective reasoning. Our method introduces a novel two-tier stimulus-aware mechanism: event-driven frame-level sampling and token-level spatiotemporal tube region selection, alongside a dedicated Video Affective Reasoning (VAR) instruction-tuning dataset. It integrates MLLMs with spatiotemporal token selection, event-aware sampling, and emotion-guided instruction fine-tuning, supported by a comprehensive, multi-dimensional interpretability evaluation protocol. Experimental results demonstrate state-of-the-art performance across multiple affective reasoning benchmarks, significantly improving accuracy in emotional response prediction and enhancing coherence, insightfulness, and interpretability in affective attribution analysis.

Technology Category

Application Category

📝 Abstract

Predicting and reasoning how a video would make a human feel is crucial for developing socially intelligent systems. Although Multimodal Large Language Models (MLLMs) have shown impressive video understanding capabilities, they tend to focus more on the semantic content of videos, often overlooking emotional stimuli. Hence, most existing MLLMs fall short in estimating viewers' emotional reactions and providing plausible explanations. To address this issue, we propose StimuVAR, a spatiotemporal Stimuli-aware framework for Video Affective Reasoning (VAR) with MLLMs. StimuVAR incorporates a two-level stimuli-aware mechanism: frame-level awareness and token-level awareness. Frame-level awareness involves sampling video frames with events that are most likely to evoke viewers' emotions. Token-level awareness performs tube selection in the token space to make the MLLM concentrate on emotion-triggered spatiotemporal regions. Furthermore, we create VAR instruction data to perform affective training, steering MLLMs' reasoning strengths towards emotional focus and thereby enhancing their affective reasoning ability. To thoroughly assess the effectiveness of VAR, we provide a comprehensive evaluation protocol with extensive metrics. StimuVAR is the first MLLM-based method for viewer-centered VAR. Experiments demonstrate its superiority in understanding viewers' emotional responses to videos and providing coherent and insightful explanations.

Problem

Research questions and friction points this paper is trying to address.

Predicting human emotional reactions to videos

Overcoming MLLMs' neglect of emotional stimuli

Enhancing affective reasoning with spatiotemporal awareness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-level stimuli-aware mechanism for emotion focus

VAR instruction data for affective training

Tube selection in token space for spatiotemporal regions

🔎 Similar Papers

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models