🤖 AI Summary
This study addresses the critical absence of large-scale micro-gesture datasets that capture unconscious, emotion-reflective movements in real-world settings. To bridge this gap, the authors introduce iMiGUE-3K, the largest in-the-wild micro-gesture video dataset to date, comprising 32 distinct micro-gesture categories and 37 million frames of authentic scene footage. Furthermore, they propose MG-FMs, the first transferable discriminative foundation model tailored for micro-gestures, which integrates self-supervised pretraining with transfer learning. Evaluated across five standardized downstream tasks, MG-FMs demonstrates substantial improvements in emotion recognition performance. This work establishes a new paradigm and provides a robust benchmark for affective computing and human-computer interaction research.
📝 Abstract
Emotion understanding is a fundamental challenge in affective computing and artificial intelligence. While existing approaches predominantly focus on facial expressions and speech, they often overlook the rich emotional cues conveyed through body language. Recently, micro-gestures (MGs), unintentional, subconscious movements driven by inner feelings, have attracted increasing attention as an alternative to other cues. However, there are no existing large-scale datasets supporting the pre-training of the MG foundation model. To advance MG research, we present a new benchmark for micro-gesture-based emotion understanding, featuring key contributions with a novel dataset (iMiGUE-3K) and a series of foundation models for different tasks. Using a model-based crowd-sourcing data collection strategy, we construct iMiGUE-3K, the largest MG dataset to date. It comprises video recordings from 332 distinct professional tennis players' public press interviews over the past seven years, totaling more than 3.4K long video clips and 37 million frames. The dataset includes 32 micro-gesture classes with rich descriptive annotations, making it the first large-scale, in-the-wild, video dataset for fine-grained gesture-based emotion analysis. Built on iMiGUE-3K, we propose MG-FMs, a discriminative foundation model for transferable gesture presentation learning. Based on the foundation model, we establish five comprehensive evaluation tasks: MG recognition (unsupervised, semi-supervised, supervised), MG retrieval, and MG emotion recognition. Our systematic evaluation of representative methods demonstrates that micro-gesture-based analysis significantly improves emotion understanding. We hope this work can provide comprehensive tools for MG analysis and set a solid foundation for future research in psychological diagnostics, affective computing, and advanced human-computer interaction.