๐ค AI Summary
This work addresses the limitations of conventional video pretraining methods for facial expression recognition, which rely on pixel-level reconstruction and are thus susceptible to interference from irrelevant background information. To overcome this, the study introducesโ for the first timeโa purely embedding-based predictive self-supervised learning approach tailored to this task. Specifically, it proposes a pretraining framework built upon the Video Joint-Embedding Predictive Architecture (V-JEPA), which learns expression-relevant features by predicting semantic embeddings of masked video regions from unmasked ones, thereby avoiding the redundancy inherent in pixel reconstruction. A shallow classifier applied on top of the pretrained encoder suffices for highly effective recognition. Experiments demonstrate state-of-the-art performance on RAVDESS and superior results over all purely visual methods on CREMA-D (+1.48% WAR), along with strong cross-dataset generalization capabilities.
๐ Abstract
This paper introduces a novel application of Video Joint-Embedding Predictive Architectures (V-JEPAs) for Facial Expression Recognition (FER). Departing from conventional pre-training methods for video understanding that rely on pixel-level reconstructions, V-JEPAs learn by predicting embeddings of masked regions from the embeddings of unmasked regions. This enables the trained encoder to not capture irrelevant information about a given video like the color of a region of pixels in the background. Using a pre-trained V-JEPA video encoder, we train shallow classifiers using the RAVDESS and CREMA-D datasets, achieving state-of-the-art performance on RAVDESS and outperforming all other vision-based methods on CREMA-D (+1.48 WAR). Furthermore, cross-dataset evaluations reveal strong generalization capabilities, demonstrating the potential of purely embedding-based pre-training approaches to advance FER. We release our code at https://github.com/lennarteingunia/vjepa-for-fer.