EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model

📅 2024-05-01

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This study addresses long-term affective modeling in de-identified videos, tackling privacy leakage and transient interference caused by existing methods’ reliance on sensitive modalities (e.g., facial expressions or speech). To this end, we introduce EALD—the first benchmark dataset derived from post-match athlete interviews—and propose Non-Facial Bodily Language (NFBL) as an endogenous, privacy-preserving long-range affective representation paradigm. Methodologically, we integrate Multimodal Large Language Models (MLLMs), fine-grained NFBL modeling, zero-shot cross-modal alignment, and de-identified joint reasoning over visual, auditory, and NFBL signals. Experiments demonstrate that MLLMs achieve superior zero-shot performance over supervised unimodal baselines; NFBL is empirically validated as a critical cue for long-video affective discrimination. Both the EALD dataset and baseline models are publicly released.

Technology Category

Application Category

📝 Abstract

Emotion AI is the ability of computers to understand human emotional states. Existing works have achieved promising progress, but two limitations remain to be solved: 1) Previous studies have been more focused on short sequential video emotion analysis while overlooking long sequential video. However, the emotions in short sequential videos only reflect instantaneous emotions, which may be deliberately guided or hidden. In contrast, long sequential videos can reveal authentic emotions; 2) Previous studies commonly utilize various signals such as facial, speech, and even sensitive biological signals (e.g., electrocardiogram). However, due to the increasing demand for privacy, developing Emotion AI without relying on sensitive signals is becoming important. To address the aforementioned limitations, in this paper, we construct a dataset for Emotion Analysis in Long-sequential and De-identity videos called EALD by collecting and processing the sequences of athletes' post-match interviews. In addition to providing annotations of the overall emotional state of each video, we also provide the Non-Facial Body Language (NFBL) annotations for each player. NFBL is an inner-driven emotional expression and can serve as an identity-free clue to understanding the emotional state. Moreover, we provide a simple but effective baseline for further research. More precisely, we evaluate the Multimodal Large Language Models (MLLMs) with de-identification signals (e.g., visual, speech, and NFBLs) to perform emotion analysis. Our experimental results demonstrate that: 1) MLLMs can achieve comparable, even better performance than the supervised single-modal models, even in a zero-shot scenario; 2) NFBL is an important cue in long sequential emotion analysis. EALD will be available on the open-source platform.

Problem

Research questions and friction points this paper is trying to address.

Analyzing emotions in long-sequential videos for authenticity.

Developing Emotion AI without using sensitive biological signals.

Utilizing Non-Facial Body Language for de-identified emotion analysis.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal Large Language Model

Long-sequential video analysis

Non-Facial Body Language annotations

🔎 Similar Papers

StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models