DEEMO: De-identity Multimodal Emotion Recognition and Reasoning

📅 2025-04-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional multimodal emotion recognition relies on identity-sensitive cues (e.g., facial appearance and voiceprint), posing significant privacy risks. Method: This paper proposes a de-identified multimodal emotion recognition and reasoning paradigm. We introduce the first dual-modal benchmark dataset supporting Non-Facial Body Language (NFBL) modeling and instruction-driven reasoning, and design DEEMO-LLaMA—a unified architecture integrating de-identified audio-visual representation learning, an NFBL-aware module, and a cue-alignment reasoning mechanism. Crucially, it achieves identity-agnostic emotion understanding without facial or vocal biometric cues. Contribution/Results: Our method attains 74.49% accuracy (F1 = 74.45%) on emotion classification; for reasoning tasks, it achieves cue–label overlap scores of 6.20 and 7.66—substantially outperforming existing Multimodal Large Language Models (MLLMs). This work establishes a foundational framework for privacy-preserving, trustworthy affective computing in sensitive applications.

Technology Category

Application Category

📝 Abstract
Emotion understanding is a critical yet challenging task. Most existing approaches rely heavily on identity-sensitive information, such as facial expressions and speech, which raises concerns about personal privacy. To address this, we introduce the De-identity Multimodal Emotion Recognition and Reasoning (DEEMO), a novel task designed to enable emotion understanding using de-identified video and audio inputs. The DEEMO dataset consists of two subsets: DEEMO-NFBL, which includes rich annotations of Non-Facial Body Language (NFBL), and DEEMO-MER, an instruction dataset for Multimodal Emotion Recognition and Reasoning using identity-free cues. This design supports emotion understanding without compromising identity privacy. In addition, we propose DEEMO-LLaMA, a Multimodal Large Language Model (MLLM) that integrates de-identified audio, video, and textual information to enhance both emotion recognition and reasoning. Extensive experiments show that DEEMO-LLaMA achieves state-of-the-art performance on both tasks, outperforming existing MLLMs by a significant margin, achieving 74.49% accuracy and 74.45% F1-score in de-identity emotion recognition, and 6.20 clue overlap and 7.66 label overlap in de-identity emotion reasoning. Our work contributes to ethical AI by advancing privacy-preserving emotion understanding and promoting responsible affective computing.
Problem

Research questions and friction points this paper is trying to address.

Enables emotion recognition without identity-sensitive data
Addresses privacy concerns in multimodal emotion analysis
Improves accuracy in de-identified emotion understanding tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

De-identified multimodal inputs for emotion recognition
Non-Facial Body Language annotations for privacy
Multimodal Large Language Model integrating de-identified data
D
Deng Li
Lappeenranta-Lahti University of Technology LUT, Lappeenranta, Finland
Bohao Xing
Bohao Xing
Lappeenranta-Lahti University of Technology LUT
Emotion AI
X
Xin Liu
Lappeenranta-Lahti University of Technology LUT, Lappeenranta, Finland
B
Baiqiang Xia
Silo AI, Helsinki, Finland
Bihan Wen
Bihan Wen
Associate Professor, Nanyang Technological University
Machine LearningImage ProcessingComputational ImagingComputer VisionTrustworthy AI
H
Heikki Kalviainen
Lappeenranta-Lahti University of Technology LUT, Lappeenranta, Finland; Brno University of Technology, Brno, Czech Republic