Hierarchical Vision-Language Interaction for Facial Action Unit Detection

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of insufficient discriminability and generalization in facial Action Unit (AU) representation learning under limited labeled data. The authors propose HiVA, a novel approach that leverages large language models to generate rich AU semantic descriptions as priors and introduces a hierarchical cross-modal attention architecture to integrate fine-grained and global visual-language associations. HiVA innovatively incorporates an AU-aware Dynamic Dependency Cross-Attention (DDCA) module and a dual-path Cross-Modal Cross-Attention (CDCA) mechanism to explicitly model AU-specific interactions and global dependencies, respectively, thereby enhancing semantic interpretability and robustness. Extensive experiments demonstrate that HiVA outperforms existing methods across multiple benchmarks and produces semantically coherent activation patterns, effectively validating its superiority in cross-modal alignment and facial behavior analysis.

Technology Category

Application Category

📝 Abstract
Facial Action Unit (AU) detection seeks to recognize subtle facial muscle activations as defined by the Facial Action Coding System (FACS). A primary challenge w.r.t AU detection is the effective learning of discriminative and generalizable AU representations under conditions of limited annotated data. To address this, we propose a Hierarchical Vision-language Interaction for AU Understanding (HiVA) method, which leverages textual AU descriptions as semantic priors to guide and enhance AU detection. Specifically, HiVA employs a large language model to generate diverse and contextually rich AU descriptions to strengthen language-based representation learning. To capture both fine-grained and holistic vision-language associations, HiVA introduces an AU-aware dynamic graph module that facilitates the learning of AU-specific visual representations. These features are further integrated within a hierarchical cross-modal attention architecture comprising two complementary mechanisms: Disentangled Dual Cross-Attention (DDCA), which establishes fine-grained, AU-specific interactions between visual and textual features, and Contextual Dual Cross-Attention (CDCA), which models global inter-AU dependencies. This collaborative, cross-modal learning paradigm enables HiVA to leverage multi-grained vision-based AU features in conjunction with refined language-based AU details, culminating in robust and semantically enriched AU detection capabilities. Extensive experiments show that HiVA consistently surpasses state-of-the-art approaches. Besides, qualitative analyses reveal that HiVA produces semantically meaningful activation patterns, highlighting its efficacy in learning robust and interpretable cross-modal correspondences for comprehensive facial behavior analysis.
Problem

Research questions and friction points this paper is trying to address.

Facial Action Unit Detection
Limited Annotated Data
Discriminative Representation
Generalizable Representation
Vision-Language Interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language interaction
facial action unit detection
hierarchical cross-modal attention
large language model
dynamic graph module
🔎 Similar Papers
No similar papers found.
Yong Li
Yong Li
Associate Professor, Southeast University
machine learningmultimodal understandingaffective computing
Y
Yi Ren
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
Yizhe Zhang
Yizhe Zhang
Nanjing University of Science and Technology
Medical Image AnalysisMachine LearningAlgorithm Design
W
Wenhua Zhang
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
T
Tianyi Zhang
Key Laboratory of Child Development and Learning Science (Ministry of Education), School of Biological Sciences and Medical Engineering, Southeast University, Nanjing, China
Muyun Jiang
Muyun Jiang
Nanyang Technological University
Guo-Sen Xie
Guo-Sen Xie
Professor, Nanjing University of Science and Technology
Computer VisionMachine Learning
Cuntai Guan
Cuntai Guan
President's Chair Professor, CCDS, Nanyang Technological University
Brain-Computer InterfaceBrain-Computer InterfacesMachine LearningArtificial Intelligence