Milmer: a Framework for Multiple Instance Learning based Multimodal Emotion Recognition

📅 2025-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient accuracy of emotion recognition for middle-school students in human-computer interaction, this paper proposes an end-to-end multimodal framework integrating facial expression videos and EEG signals. Methodologically: (1) it introduces multi-instance learning (MIL) into multimodal emotion recognition for the first time to model facial temporal dynamics; (2) it designs a cross-modal cross-attention mechanism enabling adaptive fusion of visual and physiological features; and (3) it combines fine-tuned Swin Transformers with time-frequency preprocessing of EEG signals. Evaluated on the DEAP dataset, the framework achieves 96.72% accuracy in four-class emotion classification—significantly outperforming state-of-the-art methods. Ablation studies confirm the critical contributions of MIL-based temporal modeling and synergistic time-frequency–semantic feature fusion. This work establishes a novel paradigm for robust, interpretable emotion recognition tailored to educational settings.

Technology Category

Application Category

📝 Abstract
Emotions play a crucial role in human behavior and decision-making, making emotion recognition a key area of interest in human-computer interaction (HCI). This study addresses the challenges of emotion recognition by integrating facial expression analysis with electroencephalogram (EEG) signals, introducing a novel multimodal framework-Milmer. The proposed framework employs a transformer-based fusion approach to effectively integrate visual and physiological modalities. It consists of an EEG preprocessing module, a facial feature extraction and balancing module, and a cross-modal fusion module. To enhance visual feature extraction, we fine-tune a pre-trained Swin Transformer on emotion-related datasets. Additionally, a cross-attention mechanism is introduced to balance token representation across modalities, ensuring effective feature integration. A key innovation of this work is the adoption of a multiple instance learning (MIL) approach, which extracts meaningful information from multiple facial expression images over time, capturing critical temporal dynamics often overlooked in previous studies. Extensive experiments conducted on the DEAP dataset demonstrate the superiority of the proposed framework, achieving a classification accuracy of 96.72% in the four-class emotion recognition task. Ablation studies further validate the contributions of each module, highlighting the significance of advanced feature extraction and fusion strategies in enhancing emotion recognition performance. Our code are available at https://github.com/liangyubuaa/Milmer.
Problem

Research questions and friction points this paper is trying to address.

Emotion Recognition
Facial Expressions
EEG Signals
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Fusion
Swin Transformer
Balanced Mechanism
🔎 Similar Papers
No similar papers found.
Zaitian Wang
Zaitian Wang
Computer Network Information Center, Chinese Academy of Sciences
Data-centric AILarge Language Models
J
Jian He
Beijing University of Technology, Beijing, 100124, China
Y
Yu Liang
Beijing University of Technology, Beijing, 100124, China
Xiyuan Hu
Xiyuan Hu
Beijing University of Technology, Beijing, 100124, China
T
Tianhao Peng
Beihang University, Beijing, 100191, China
K
Kaixin Wang
Beijing University of Technology, Beijing, 100124, China
Jiakai Wang
Jiakai Wang
Zhongguancun Laboratory
Adversarial examplesTrustworthy AI
Chenlong Zhang
Chenlong Zhang
Institute of Automation, Chinese Academy of Sciences
Natural Language ProcessingLarge Language Models
Weili Zhang
Weili Zhang
Beijing University of Technology, Beijing, 100124, China
S
Shuang Niu
Beijing University of Technology, Beijing, 100124, China
X
Xiaoyang Xie
Beijing University of Technology, Beijing, 100124, China