Multi-modal Collaborative Optimization and Expansion Network for Event-assisted Single-eye Expression Recognition

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor robustness of monocular facial expression recognition under low-light, overexposed, and high-dynamic-range (HDR) conditions, this paper proposes a multimodal collaborative recognition framework integrating event cameras and conventional frame-based images. Our method introduces two key innovations: (1) the Multimodal Collaborative Optimization Mamba (MCO-Mamba), the first Mamba-based architecture designed for joint temporal-semantic modeling across both modalities; and (2) the Heterogeneous Collaborative Extended Mixture-of-Experts (HCE-MoE), which unifies heterogeneous experts—depth-based, attention-based, and focus-based—via dynamic routing and cross-modal feature fusion between event streams and intensity frames. Extensive experiments demonstrate substantial improvements in recognition accuracy and generalization under challenging illumination conditions. The proposed framework achieves state-of-the-art (SOTA) performance across multiple benchmark monocular facial expression datasets.

Technology Category

Application Category

📝 Abstract
In this paper, we proposed a Multi-modal Collaborative Optimization and Expansion Network (MCO-E Net), to use event modalities to resist challenges such as low light, high exposure, and high dynamic range in single-eye expression recognition tasks. The MCO-E Net introduces two innovative designs: Multi-modal Collaborative Optimization Mamba (MCO-Mamba) and Heterogeneous Collaborative and Expansion Mixture-of-Experts (HCE-MoE). MCO-Mamba, building upon Mamba, leverages dual-modal information to jointly optimize the model, facilitating collaborative interaction and fusion of modal semantics. This approach encourages the model to balance the learning of both modalities and harness their respective strengths. HCE-MoE, on the other hand, employs a dynamic routing mechanism to distribute structurally varied experts (deep, attention, and focal), fostering collaborative learning of complementary semantics. This heterogeneous architecture systematically integrates diverse feature extraction paradigms to comprehensively capture expression semantics. Extensive experiments demonstrate that our proposed network achieves competitive performance in the task of single-eye expression recognition, especially under poor lighting conditions.
Problem

Research questions and friction points this paper is trying to address.

Resist low light, high exposure in single-eye expression recognition
Balance dual-modal learning for collaborative semantic fusion
Integrate diverse feature extraction for comprehensive expression capture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal Collaborative Optimization Mamba for dual-modal fusion
Heterogeneous Collaborative and Expansion Mixture-of-Experts for dynamic routing
Event modalities to resist low light and high dynamic range
🔎 Similar Papers
No similar papers found.
Runduo Han
Runduo Han
Dalian University of technology
X
Xiuping Liu
S
Shangxuan Yi
Y
Yi Zhang
H
Hongchen Tan