🤖 AI Summary
Existing approaches struggle to achieve effective bidirectional knowledge transfer between facial action unit (AU) detection and facial expression (FE) recognition due to discrepancies in annotation paradigms, label granularity, and data diversity across heterogeneous datasets. This work proposes a Structured Semantic Mapping (SSM) framework that enables joint bidirectional learning of AUs and FE within a unified semantic space by leveraging a shared visual backbone, Textual Semantic Prototypes (TSP), and a Dynamic Prior Mapping (DPM) module. SSM is the first method to accomplish bidirectional knowledge transfer under heterogeneous supervision, utilizing Facial Action Coding System priors to construct a dynamic association matrix and employing textual semantics as cross-task alignment anchors. Experiments demonstrate that SSM achieves state-of-the-art performance on major AU and FE benchmarks, confirming that holistic expression semantics can significantly enhance fine-grained AU learning.
📝 Abstract
Facial action unit (AU) detection and facial expression (FE) recognition can be jointly viewed as affective facial behavior tasks, representing fine-grained muscular activations and coarse-grained holistic affective states, respectively. Despite their inherent semantic correlation, existing studies predominantly focus on knowledge transfer from AUs to FEs, while bidirectional learning remains insufficiently explored. In practice, this challenge is further compounded by heterogeneous data conditions, where AU and FE datasets differ in annotation paradigms (frame-level vs.\ clip-level), label granularity, and data availability and diversity, hindering effective joint learning. To address these issues, we propose a Structured Semantic Mapping (SSM) framework for bidirectional AU--FE learning under different data domains and heterogeneous supervision. SSM consists of three key components: (1) a shared visual backbone that learns unified facial representations from dynamic AU and FE videos; (2) semantic mediation via a Textual Semantic Prototype (TSP) module, which constructs structured semantic prototypes from fixed textual descriptions augmented with learnable context prompts, serving as supervision signals and cross-task alignment anchors in a shared semantic space; and (3) a Dynamic Prior Mapping (DPM) module that incorporates prior knowledge derived from the Facial Action Coding System and learns a data-driven association matrix in a high-level feature space, enabling explicit and bidirectional knowledge transfer. Extensive experiments on popular AU detection and FE recognition benchmarks show that SSM achieves state-of-the-art performance on both tasks simultaneously, and demonstrate that holistic expression semantics can in turn enhance fine-grained AU learning even across heterogeneous datasets.