🤖 AI Summary
Existing facial expression coding systems (e.g., FACS) suffer from limited coverage and reliance on costly manual annotation, hindering scalability and robustness in fine-grained expression analysis. To address this, we propose Discrete Facial Encoding (DFE): a novel unsupervised learning framework that operates on 3D mesh sequences and leverages identity-invariant 3D Morphable Model (3DMM) features. DFE introduces the first Residual Vector Quantized Variational Autoencoder (RVQ-VAE) for facial dynamics, automatically discovering a compact, interpretable, and pose-decoupled discrete codebook of facial deformations. This codebook enables reusable, semantically meaningful expression pattern labels, substantially expanding coverage beyond the FACS behavioral repertoire. Integrating a Bag-of-Words representation over DFE codes, our method achieves significant performance gains over FACS and MAE baselines in stress detection, personality prediction, and depression recognition—demonstrating superior effectiveness, interpretability, and generalizability for psychological and affective computing.
📝 Abstract
Facial expression analysis is central to understanding human behavior, yet existing coding systems such as the Facial Action Coding System (FACS) are constrained by limited coverage and costly manual annotation. In this work, we introduce Discrete Facial Encoding (DFE), an unsupervised, data-driven alternative of compact and interpretable dictionary of facial expressions from 3D mesh sequences learned through a Residual Vector Quantized Variational Autoencoder (RVQ-VAE). Our approach first extracts identity-invariant expression features from images using a 3D Morphable Model (3DMM), effectively disentangling factors such as head pose and facial geometry. We then encode these features using an RVQ-VAE, producing a sequence of discrete tokens from a shared codebook, where each token captures a specific, reusable facial deformation pattern that contributes to the overall expression. Through extensive experiments, we demonstrate that Discrete Facial Encoding captures more precise facial behaviors than FACS and other facial encoding alternatives. We evaluate the utility of our representation across three high-level psychological tasks: stress detection, personality prediction, and depression detection. Using a simple Bag-of-Words model built on top of the learned tokens, our system consistently outperforms both FACS-based pipelines and strong image and video representation learning models such as Masked Autoencoders. Further analysis reveals that our representation covers a wider variety of facial displays, highlighting its potential as a scalable and effective alternative to FACS for psychological and affective computing applications.