SemanticFace: Semantic Facial Action Estimation via Semantic Distillation in Interpretable Space

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for estimating facial action parameters from a single image often lack semantic interpretability, limiting their applicability in scenarios such as virtual character control that require muscle-level motion semantics. To address this, this work proposes SemanticFace, a framework that introduces a two-stage semantic distillation process within the ARKit blendshape space. First, structured semantic supervision signals are derived from ground-truth blendshape coefficients; these signals are then distilled into a multimodal large language model to enable end-to-end prediction of interpretable facial action parameters directly from images. This study presents the first integration of structured semantic reasoning with multimodal large language models for facial action estimation, achieving not only high prediction accuracy and perceptual consistency but also significantly enhanced generalization and robustness across identities and domains, including stylized faces such as cartoons.

Technology Category

Application Category

📝 Abstract
Facial action estimation from a single image is often formulated as predicting or fitting parameters in compact expression spaces, which lack explicit semantic interpretability. However, many practical applications, such as avatar control and human-computer interaction, require interpretable facial actions that correspond to meaningful muscle movements. In this work, we propose \textbf{SemanticFace}, a framework for facial action estimation in the interpretable ARKit blendshape space that reformulates coefficient prediction as structured semantic reasoning. SemanticFace adopts a two-stage semantic distillation paradigm: it first derives structured semantic supervision from ground-truth ARKit coefficients and then distills this knowledge into a multimodal large language model to predict interpretable facial action coefficients from images. Extensive experiments demonstrate that language-aligned semantic supervision improves both coefficient accuracy and perceptual consistency, while enabling strong cross-identity generalization and robustness to large domain shifts, including cartoon faces.
Problem

Research questions and friction points this paper is trying to address.

facial action estimation
semantic interpretability
ARKit blendshape
interpretable space
single-image
Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic distillation
interpretable facial action
ARKit blendshape
multimodal large language model
structured semantic reasoning
🔎 Similar Papers
No similar papers found.
Z
Zejian Kang
Zhejiang University
K
Kai Zheng
Westlake University
Y
Yuanchen Fei
Hunan University
W
Wentao Yang
Zhejiang University
H
Hongyuan Zou
The University of Hong Kong
Xiangru Huang
Xiangru Huang
Westlake University
Machine Learning and OptimizationGeometry ProcessingDeep Learning