Video Joint-Embedding Predictive Architectures for Facial Expression Recognition

๐Ÿ“… 2026-01-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitations of conventional video pretraining methods for facial expression recognition, which rely on pixel-level reconstruction and are thus susceptible to interference from irrelevant background information. To overcome this, the study introducesโ€” for the first timeโ€”a purely embedding-based predictive self-supervised learning approach tailored to this task. Specifically, it proposes a pretraining framework built upon the Video Joint-Embedding Predictive Architecture (V-JEPA), which learns expression-relevant features by predicting semantic embeddings of masked video regions from unmasked ones, thereby avoiding the redundancy inherent in pixel reconstruction. A shallow classifier applied on top of the pretrained encoder suffices for highly effective recognition. Experiments demonstrate state-of-the-art performance on RAVDESS and superior results over all purely visual methods on CREMA-D (+1.48% WAR), along with strong cross-dataset generalization capabilities.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper introduces a novel application of Video Joint-Embedding Predictive Architectures (V-JEPAs) for Facial Expression Recognition (FER). Departing from conventional pre-training methods for video understanding that rely on pixel-level reconstructions, V-JEPAs learn by predicting embeddings of masked regions from the embeddings of unmasked regions. This enables the trained encoder to not capture irrelevant information about a given video like the color of a region of pixels in the background. Using a pre-trained V-JEPA video encoder, we train shallow classifiers using the RAVDESS and CREMA-D datasets, achieving state-of-the-art performance on RAVDESS and outperforming all other vision-based methods on CREMA-D (+1.48 WAR). Furthermore, cross-dataset evaluations reveal strong generalization capabilities, demonstrating the potential of purely embedding-based pre-training approaches to advance FER. We release our code at https://github.com/lennarteingunia/vjepa-for-fer.
Problem

Research questions and friction points this paper is trying to address.

Facial Expression Recognition
Video Understanding
Generalization
Representation Learning
Irrelevant Information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video Joint-Embedding Predictive Architecture
Facial Expression Recognition
embedding-based pre-training
masked prediction
generalization
๐Ÿ”Ž Similar Papers
No similar papers found.
L
Lennart Eing
Chair for Human-Centered Artificial Intelligence, University of Augsburg, Augsburg, Germany
C
Cristina Luna-Jimenez
Chair for Human-Centered Artificial Intelligence, University of Augsburg, Augsburg, Germany
Silvan Mertes
Silvan Mertes
University of Applied Sciences Augsburg
HCIGenerative AIExplainable AIVR/AR
Elisabeth Andre
Elisabeth Andre
Professor of Computer Sciences, Augsburg University
Intelligent User InterfacesAffective ComputingSocial RoboticsVirtual HumansSocial Signal Processing