Multi-Track Multimodal Learning on iMiGUE: Micro-Gesture and Emotion Recognition

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two fine-grained behavioral understanding challenges: micro-gesture recognition and behavior-driven emotion prediction. We propose a dual-track multimodal framework: one track fuses RGB video and 3D skeletal pose representations, while the other jointly models facial and contextual visual features. Innovatively, we design Cross-Modal Token Fusion and InterFusion modules to jointly capture cross-modal spatiotemporal dependencies between micro-gestures and emotions—the first such unified modeling on the iMiGUE dataset. Multisource features are extracted using MViTv2-S (for RGB), 2s-AGCN (for skeletons), and SwinFace (for facial cues), integrated via our custom fusion mechanisms under an end-to-end joint training paradigm. Our method ranked second in the Emotion Prediction subtask of the MiGA 2025 Challenge, achieving significant improvements in robustness for micro-behavior recognition (+4.2%) and emotion classification accuracy (+3.8%).

Technology Category

Application Category

📝 Abstract
Micro-gesture recognition and behavior-based emotion prediction are both highly challenging tasks that require modeling subtle, fine-grained human behaviors, primarily leveraging video and skeletal pose data. In this work, we present two multimodal frameworks designed to tackle both problems on the iMiGUE dataset. For micro-gesture classification, we explore the complementary strengths of RGB and 3D pose-based representations to capture nuanced spatio-temporal patterns. To comprehensively represent gestures, video, and skeletal embeddings are extracted using MViTv2-S and 2s-AGCN, respectively. Then, they are integrated through a Cross-Modal Token Fusion module to combine spatial and pose information. For emotion recognition, our framework extends to behavior-based emotion prediction, a binary classification task identifying emotional states based on visual cues. We leverage facial and contextual embeddings extracted using SwinFace and MViTv2-S models and fuse them through an InterFusion module designed to capture emotional expressions and body gestures. Experiments conducted on the iMiGUE dataset, within the scope of the MiGA 2025 Challenge, demonstrate the robust performance and accuracy of our method in the behavior-based emotion prediction task, where our approach secured 2nd place.
Problem

Research questions and friction points this paper is trying to address.

Recognizing subtle micro-gestures from video and pose data
Predicting emotional states based on visual behavioral cues
Integrating multimodal features for spatio-temporal pattern analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses RGB and 3D pose data for micro-gesture recognition
Integrates facial and contextual embeddings for emotion prediction
Employs cross-modal fusion modules to combine different data modalities
🔎 Similar Papers
No similar papers found.
A
Arman Martirosyan
Russian - Armenian University, Yerevan, Armenia
S
Shahane Tigranyan
Russian - Armenian University, Yerevan, Armenia
M
Maria Razzhivina
ISP RAS, Moscow, Russia
A
Artak Aslanyan
HSE University, Moscow, Russia
N
Nazgul Salikhova
Innopolis University, Innopolis, Russia
Ilya Makarov
Ilya Makarov
Principal AI Researcher
Artificial IntelligenceComputer VisionNetwork ScienceGame DesignAugmented Reality
Andrey Savchenko
Andrey Savchenko
Sber AI Lab; HSE University - Nizhny Novgorod
Computer VisionPattern RecognitionMachine LearningSpeech ProcessingImage Processing
A
Aram Avetisyan
ISP RAS, Moscow, Russia; ISP RAS Research Center for Trusted Artificial Intelligence, Moscow, Russia