Hierarchical MoE: Continuous Multimodal Emotion Recognition with Incomplete and Asynchronous Inputs

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of dynamic modality missing and asynchronous multimodal inputs in real-world Multimodal Emotion Recognition (MER), this paper proposes a robust framework for continuous emotion prediction. Methodologically, it introduces a two-tier Mixture-of-Experts (MoE) architecture: a bottom-tier modality-specific expert pool with a soft routing mechanism adaptively handles missing modalities; an upper-tier emotion-specific expert pool employs differential attention to dynamically focus on emotion prototypes and integrates a cross-modal alignment module to mitigate temporal misalignment and semantic inconsistency. This design enables dynamic, fine-grained cross-modal fusion and discriminative emotion representation. Evaluated on DEAP and DREAMER benchmarks, the framework achieves state-of-the-art performance, demonstrating significantly enhanced robustness and generalization under challenging conditions—including arbitrary modality dropout and asynchronous sampling—while preserving temporal coherence and semantic fidelity across modalities.

Technology Category

Application Category

📝 Abstract
Multimodal emotion recognition (MER) is crucial for human-computer interaction, yet real-world challenges like dynamic modality incompleteness and asynchrony severely limit its robustness. Existing methods often assume consistently complete data or lack dynamic adaptability. To address these limitations, we propose a novel Hi-MoE~(Hierarchical Mixture-of-Experts) framework for robust continuous emotion prediction. This framework employs a dual-layer expert structure. A Modality Expert Bank utilizes soft routing to dynamically handle missing modalities and achieve robust information fusion. A subsequent Emotion Expert Bank leverages differential-attention routing to flexibly attend to emotional prototypes, enabling fine-grained emotion representation. Additionally, a cross-modal alignment module explicitly addresses temporal shifts and semantic inconsistencies between modalities. Extensive experiments on benchmark datasets DEAP and DREAMER demonstrate our model's state-of-the-art performance in continuous emotion regression, showcasing exceptional robustness under challenging conditions such as dynamic modality absence and asynchronous sampling. This research significantly advances the development of intelligent emotion systems adaptable to complex real-world environments.
Problem

Research questions and friction points this paper is trying to address.

Dynamic modality incompleteness in MER
Asynchronous multimodal inputs challenge
Lack of robust emotion prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical MoE for robust emotion prediction
Soft routing handles missing modalities dynamically
Differential-attention routing for fine-grained emotion representation
🔎 Similar Papers
No similar papers found.
Yitong Zhu
Yitong Zhu
the Hong Kong University of Science and Technology(GuangZhou))
human factors engineeringmulti-modal learningaffective computing
L
Lei Han
the Hong Kong University of Science and Technology(Guangzhou)
G
GuanXuan Jiang
the Hong Kong University of Science and Technology(Guangzhou)
P
PengYuan Zhou
the Aarhus University
Y
Yuyang Wang
the Hong Kong University of Science and Technology(Guangzhou)