π€ AI Summary
This work addresses the challenge of recognizing concealed emotions in videos under conditions of scarce annotations and severe class imbalance by proposing a multimodal weakly supervised framework. High-quality pseudo-labels are generated using the Gemini 2.5 Pro vision-language model enhanced with chain-of-thought and self-reflection prompting. The approach integrates facial visual features (DINOv2), pose keypoint sequences (OpenPose), and interview transcripts (BERT) through a staged pretraining and joint fine-tuning strategy. Notably, a multilayer perceptron (MLP) is innovatively employed to model spatiotemporal relationships among keypoints, replacing conventional graph neural networks and achieving a favorable trade-off between efficiency and accuracy. Evaluated on the iMiGUE dataset, the method improves accuracy from below 0.60 to over 0.69, establishing a new state-of-the-art and demonstrating that MLP-based modeling can match or even surpass graph convolutional networks in keypoint-based emotion recognition.
π Abstract
To tackle the automatic recognition of"concealed emotions"in videos, this paper proposes a multimodal weak-supervision framework and achieves state-of-the-art results on the iMiGUE tennis-interview dataset. First, YOLO 11x detects and crops human portraits frame-by-frame, and DINOv2-Base extracts visual features from the cropped regions. Next, by integrating Chain-of-Thought and Reflection prompting (CoT + Reflection), Gemini 2.5 Pro automatically generates pseudo-labels and reasoning texts that serve as weak supervision for downstream models. Subsequently, OpenPose produces 137-dimensional key-point sequences, augmented with inter-frame offset features; the usual graph neural network backbone is simplified to an MLP to efficiently model the spatiotemporal relationships of the three key-point streams. An ultra-long-sequence Transformer independently encodes both the image and key-point sequences, and their representations are concatenated with BERT-encoded interview transcripts. Each modality is first pre-trained in isolation, then fine-tuned jointly, with pseudo-labeled samples merged into the training set for further gains. Experiments demonstrate that, despite severe class imbalance, the proposed approach lifts accuracy from under 0.6 in prior work to over 0.69, establishing a new public benchmark. The study also validates that an"MLP-ified"key-point backbone can match - or even surpass - GCN-based counterparts in this task.