Video Joint-Embedding Predictive Architectures for Facial Expression Recognition

📅 2026-01-14

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the limitations of conventional video pretraining methods for facial expression recognition, which rely on pixel-level reconstruction and are thus susceptible to interference from irrelevant background information. To overcome this, the study introduces— for the first time—a purely embedding-based predictive self-supervised learning approach tailored to this task. Specifically, it proposes a pretraining framework built upon the Video Joint-Embedding Predictive Architecture (V-JEPA), which learns expression-relevant features by predicting semantic embeddings of masked video regions from unmasked ones, thereby avoiding the redundancy inherent in pixel reconstruction. A shallow classifier applied on top of the pretrained encoder suffices for highly effective recognition. Experiments demonstrate state-of-the-art performance on RAVDESS and superior results over all purely visual methods on CREMA-D (+1.48% WAR), along with strong cross-dataset generalization capabilities.

Technology Category

Application Category

📝 Abstract

This paper introduces a novel application of Video Joint-Embedding Predictive Architectures (V-JEPAs) for Facial Expression Recognition (FER). Departing from conventional pre-training methods for video understanding that rely on pixel-level reconstructions, V-JEPAs learn by predicting embeddings of masked regions from the embeddings of unmasked regions. This enables the trained encoder to not capture irrelevant information about a given video like the color of a region of pixels in the background. Using a pre-trained V-JEPA video encoder, we train shallow classifiers using the RAVDESS and CREMA-D datasets, achieving state-of-the-art performance on RAVDESS and outperforming all other vision-based methods on CREMA-D (+1.48 WAR). Furthermore, cross-dataset evaluations reveal strong generalization capabilities, demonstrating the potential of purely embedding-based pre-training approaches to advance FER. We release our code at https://github.com/lennarteingunia/vjepa-for-fer.

Problem

Research questions and friction points this paper is trying to address.

Facial Expression Recognition

Video Understanding

Generalization

Representation Learning

Irrelevant Information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video Joint-Embedding Predictive Architecture

Facial Expression Recognition

embedding-based pre-training

masked prediction

generalization

🔎 Similar Papers

No similar papers found.