VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video affective understanding faces significant challenges due to high temporal dynamics and strong contextual dependency. To address this, we propose the first large vision-language model framework specifically designed for affective reasoning. Our method introduces three core innovations: (1) Emo-CFG, a fine-grained, human-annotated video affective dataset; (2) Affective-Tree, a novel reinforcement learning mechanism enabling multi-step, interpretable affective reasoning over spatiotemporal cues; and (3) a curriculum-based affective learning strategy that progressively injects emotion knowledge through two-stage training. The framework unifies perception-to-cognition reasoning—from low-level facial attribute detection to high-level affective interpretation—within a single end-to-end architecture. Evaluated on 15 facial perception benchmarks, it achieves state-of-the-art performance, significantly improving both accuracy and interpretability in video-based emotion recognition. This work establishes a new paradigm for affective intelligence research and provides foundational resources—including data, architecture, and training methodology—for future studies.

Technology Category

Application Category

📝 Abstract
Understanding and predicting emotion from videos has gathered significant attention in recent studies, driven by advancements in video large language models (VideoLLMs). While advanced methods have made progress in video emotion analysis, the intrinsic nature of emotions poses significant challenges. Emotions are characterized by dynamic and cues-dependent properties, making it difficult to understand complex and evolving emotional states with reasonable rationale. To tackle these challenges, we propose a novel affective cues-guided reasoning framework that unifies fundamental attribute perception, expression analysis, and high-level emotional understanding in a stage-wise manner. At the core of our approach is a family of video emotion foundation models (VidEmo), specifically designed for emotion reasoning and instruction-following. These models undergo a two-stage tuning process: first, curriculum emotion learning for injecting emotion knowledge, followed by affective-tree reinforcement learning for emotion reasoning. Moreover, we establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG) consisting of 2.1M diverse instruction-based samples. Emo-CFG includes explainable emotional question-answering, fine-grained captions, and associated rationales, providing essential resources for advancing emotion understanding tasks. Experimental results demonstrate that our approach achieves competitive performance, setting a new milestone across 15 face perception tasks.
Problem

Research questions and friction points this paper is trying to address.

Understanding dynamic emotion states from videos with reasonable rationale
Analyzing complex evolving emotional cues that are context-dependent
Developing emotion-centric video foundation models for affective reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Affective-tree reinforcement learning for emotion reasoning
Two-stage tuning with curriculum emotion learning
Emotion-centric dataset with explainable instruction samples