🤖 AI Summary
This work addresses the challenge of insufficient discriminability and generalization in facial Action Unit (AU) representation learning under limited labeled data. The authors propose HiVA, a novel approach that leverages large language models to generate rich AU semantic descriptions as priors and introduces a hierarchical cross-modal attention architecture to integrate fine-grained and global visual-language associations. HiVA innovatively incorporates an AU-aware Dynamic Dependency Cross-Attention (DDCA) module and a dual-path Cross-Modal Cross-Attention (CDCA) mechanism to explicitly model AU-specific interactions and global dependencies, respectively, thereby enhancing semantic interpretability and robustness. Extensive experiments demonstrate that HiVA outperforms existing methods across multiple benchmarks and produces semantically coherent activation patterns, effectively validating its superiority in cross-modal alignment and facial behavior analysis.
📝 Abstract
Facial Action Unit (AU) detection seeks to recognize subtle facial muscle activations as defined by the Facial Action Coding System (FACS). A primary challenge w.r.t AU detection is the effective learning of discriminative and generalizable AU representations under conditions of limited annotated data. To address this, we propose a Hierarchical Vision-language Interaction for AU Understanding (HiVA) method, which leverages textual AU descriptions as semantic priors to guide and enhance AU detection. Specifically, HiVA employs a large language model to generate diverse and contextually rich AU descriptions to strengthen language-based representation learning. To capture both fine-grained and holistic vision-language associations, HiVA introduces an AU-aware dynamic graph module that facilitates the learning of AU-specific visual representations. These features are further integrated within a hierarchical cross-modal attention architecture comprising two complementary mechanisms: Disentangled Dual Cross-Attention (DDCA), which establishes fine-grained, AU-specific interactions between visual and textual features, and Contextual Dual Cross-Attention (CDCA), which models global inter-AU dependencies. This collaborative, cross-modal learning paradigm enables HiVA to leverage multi-grained vision-based AU features in conjunction with refined language-based AU details, culminating in robust and semantically enriched AU detection capabilities. Extensive experiments show that HiVA consistently surpasses state-of-the-art approaches. Besides, qualitative analyses reveal that HiVA produces semantically meaningful activation patterns, highlighting its efficacy in learning robust and interpretable cross-modal correspondences for comprehensive facial behavior analysis.