EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation research predominantly emphasizes low-level visual metrics, neglecting affective dimension modeling and lacking dedicated emotional video benchmarks for creative media. Method: We introduce EmoVid—the first multimodal dataset specifically designed for affective video generation—comprising animated clips, film excerpts, and GIF stickers, with fine-grained emotion annotations newly introduced to non-photorealistic video generation. We propose a spatiotemporal emotion–visual alignment model that jointly encodes emotion labels, visual attributes (brightness, saturation, hue), and textual descriptions, and perform emotion-conditioned fine-tuning on Wan2.1. Contribution/Results: Our approach significantly improves emotional consistency (+18.3% in subjective evaluation) and visual fidelity (FVD reduced by 12.7%) in text-to-video and image-to-video generation, establishing a new benchmark for emotion-driven video synthesis.

Technology Category

Application Category

📝 Abstract
Emotion plays a pivotal role in video-based expression, but existing video generation systems predominantly focus on low-level visual metrics while neglecting affective dimensions. Although emotion analysis has made progress in the visual domain, the video community lacks dedicated resources to bridge emotion understanding with generative tasks, particularly for stylized and non-realistic contexts. To address this gap, we introduce EmoVid, the first multimodal, emotion-annotated video dataset specifically designed for creative media, which includes cartoon animations, movie clips, and animated stickers. Each video is annotated with emotion labels, visual attributes (brightness, colorfulness, hue), and text captions. Through systematic analysis, we uncover spatial and temporal patterns linking visual features to emotional perceptions across diverse video forms. Building on these insights, we develop an emotion-conditioned video generation technique by fine-tuning the Wan2.1 model. The results show a significant improvement in both quantitative metrics and the visual quality of generated videos for text-to-video and image-to-video tasks. EmoVid establishes a new benchmark for affective video computing. Our work not only offers valuable insights into visual emotion analysis in artistically styled videos, but also provides practical methods for enhancing emotional expression in video generation.
Problem

Research questions and friction points this paper is trying to address.

Bridging emotion understanding with video generation tasks
Addressing neglect of affective dimensions in video systems
Developing emotion-conditioned generation for stylized video contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces first multimodal emotion-annotated video dataset
Develops emotion-conditioned generation via fine-tuned Wan2.1 model
Links visual features to emotions across diverse video forms
🔎 Similar Papers
No similar papers found.
Z
Zongyang Qiu
The Hong Kong University of Science and Technology (Guangzhou), China
Bingyuan Wang
Bingyuan Wang
The Hong Kong University of Science and Technology (Guangzhou)
Generative AIAffective ComputingImmersive StorytellingCreative IntelligenceCultural Heritage
X
Xingbei Chen
The Hong Kong University of Science and Technology (Guangzhou), China
Yingqing He
Yingqing He
HKUST
Video GenerationAIGCPost-TrainingRLHFGenerative Model
Z
Zeyu Wang
The Hong Kong University of Science and Technology, Hong Kong SAR, China