🤖 AI Summary
Current Facial Action Coding Systems (FACS) face two key bottlenecks in mental health research: limited accuracy in Action Unit (AU) detection and incomplete coverage of facial movements, hindering comprehensive facial expression representation. This paper introduces Facial Basis—the first unsupervised, data-driven 3D facial motion coding framework—replacing hand-crafted AUs with interpretable, localized motion primitives to enable complete, additive modeling. Its core contributions are: (1) constructing the first unsupervised facial motion dictionary; (2) overcoming FACS’s three fundamental limitations—reliance on manual annotation, incomplete coverage, and non-additive primitives; and (3) establishing an end-to-end AU-based comparative evaluation framework. On real-world conversational videos, Facial Basis achieves significantly higher accuracy than state-of-the-art AU detectors in predicting autism spectrum disorder diagnosis. The code and models are publicly released.
📝 Abstract
The Facial Action Coding System (FACS) has been used by numerous studies to investigate the links between facial behavior and mental health. The laborious and costly process of FACS coding has motivated the development of machine learning frameworks for Action Unit (AU) detection. Despite intense efforts spanning three decades, the detection accuracy for many AUs is considered to be below the threshold needed for behavioral research. Also, many AUs are excluded altogether, making it impossible to fulfill the ultimate goal of FACS-the representation of any facial expression in its entirety. This paper considers an alternative approach. Instead of creating automated tools that mimic FACS experts, we propose to use a new coding system that mimics the key properties of FACS. Specifically, we construct a data-driven coding system called the Facial Basis, which contains units that correspond to localized and interpretable 3D facial movements, and overcomes three structural limitations of automated FACS coding. First, the proposed method is completely unsupervised, bypassing costly, laborious and variable manual annotation. Second, Facial Basis reconstructs all observable movement, rather than relying on a limited repertoire of recognizable movements (as in automated FACS). Finally, the Facial Basis units are additive, whereas AUs may fail detection when they appear in a non-additive combination. The proposed method outperforms the most frequently used AU detector in predicting autism diagnosis from in-person and remote conversations, highlighting the importance of encoding facial behavior comprehensively. To our knowledge, Facial Basis is the first alternative to FACS for deconstructing facial expressions in videos into localized movements. We provide an open source implementation of the method at github.com/sariyanidi/FacialBasis.