🤖 AI Summary
This study addresses long-term affective modeling in de-identified videos, tackling privacy leakage and transient interference caused by existing methods’ reliance on sensitive modalities (e.g., facial expressions or speech). To this end, we introduce EALD—the first benchmark dataset derived from post-match athlete interviews—and propose Non-Facial Bodily Language (NFBL) as an endogenous, privacy-preserving long-range affective representation paradigm. Methodologically, we integrate Multimodal Large Language Models (MLLMs), fine-grained NFBL modeling, zero-shot cross-modal alignment, and de-identified joint reasoning over visual, auditory, and NFBL signals. Experiments demonstrate that MLLMs achieve superior zero-shot performance over supervised unimodal baselines; NFBL is empirically validated as a critical cue for long-video affective discrimination. Both the EALD dataset and baseline models are publicly released.
📝 Abstract
Emotion AI is the ability of computers to understand human emotional states. Existing works have achieved promising progress, but two limitations remain to be solved: 1) Previous studies have been more focused on short sequential video emotion analysis while overlooking long sequential video. However, the emotions in short sequential videos only reflect instantaneous emotions, which may be deliberately guided or hidden. In contrast, long sequential videos can reveal authentic emotions; 2) Previous studies commonly utilize various signals such as facial, speech, and even sensitive biological signals (e.g., electrocardiogram). However, due to the increasing demand for privacy, developing Emotion AI without relying on sensitive signals is becoming important. To address the aforementioned limitations, in this paper, we construct a dataset for Emotion Analysis in Long-sequential and De-identity videos called EALD by collecting and processing the sequences of athletes' post-match interviews. In addition to providing annotations of the overall emotional state of each video, we also provide the Non-Facial Body Language (NFBL) annotations for each player. NFBL is an inner-driven emotional expression and can serve as an identity-free clue to understanding the emotional state. Moreover, we provide a simple but effective baseline for further research. More precisely, we evaluate the Multimodal Large Language Models (MLLMs) with de-identification signals (e.g., visual, speech, and NFBLs) to perform emotion analysis. Our experimental results demonstrate that: 1) MLLMs can achieve comparable, even better performance than the supervised single-modal models, even in a zero-shot scenario; 2) NFBL is an important cue in long sequential emotion analysis. EALD will be available on the open-source platform.