🤖 AI Summary
Prior work on micro-gesture emotion recognition remains limited, particularly in modeling fine-grained affective dynamics from skeletal sequences.
Method: This paper proposes a hypergraph-enhanced Transformer framework—the first to introduce hypergraph modeling for skeleton-based micro-emotion analysis. It features a hypergraph self-attention module with progressively updated hyperedges to explicitly capture high-order, time-varying joint interactions; integrates multi-scale temporal convolutions and a self-supervised reconstruction decoder to precisely encode subtle motion patterns of micro-gestures; and enables end-to-end joint optimization of the emotion classification head and reconstruction task within the encoder.
Results: Evaluated on iMiGUE and SMG benchmarks, our method achieves state-of-the-art performance, significantly outperforming existing approaches in accuracy, macro-F1, and other key metrics—demonstrating the efficacy of hypergraph structures for modeling micro-expression-level emotional states.
📝 Abstract
Micro-gestures are unconsciously performed body gestures that can convey the emotion states of humans and start to attract more research attention in the fields of human behavior understanding and affective computing as an emerging topic. However, the modeling of human emotion based on micro-gestures has not been explored sufficiently. In this work, we propose to recognize the emotion states based on the micro-gestures by reconstructing the behavior patterns with a hypergraph-enhanced Transformer in a hybrid-supervised framework. In the framework, hypergraph Transformer based encoder and decoder are separately designed by stacking the hypergraph-enhanced self-attention and multiscale temporal convolution modules. Especially, to better capture the subtle motion of micro-gestures, we construct a decoder with additional upsampling operations for a reconstruction task in a self-supervised learning manner. We further propose a hypergraph-enhanced self-attention module where the hyperedges between skeleton joints are gradually updated to present the relationships of body joints for modeling the subtle local motion. Lastly, for exploiting the relationship between the emotion states and local motion of micro-gestures, an emotion recognition head from the output of encoder is designed with a shallow architecture and learned in a supervised way. The end-to-end framework is jointly trained in a one-stage way by comprehensively utilizing self-reconstruction and supervision information. The proposed method is evaluated on two publicly available datasets, namely iMiGUE and SMG, and achieves the best performance under multiple metrics, which is superior to the existing methods.