SemanticFace: Semantic Facial Action Estimation via Semantic Distillation in Interpretable Space

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing methods for estimating facial action parameters from a single image often lack semantic interpretability, limiting their applicability in scenarios such as virtual character control that require muscle-level motion semantics. To address this, this work proposes SemanticFace, a framework that introduces a two-stage semantic distillation process within the ARKit blendshape space. First, structured semantic supervision signals are derived from ground-truth blendshape coefficients; these signals are then distilled into a multimodal large language model to enable end-to-end prediction of interpretable facial action parameters directly from images. This study presents the first integration of structured semantic reasoning with multimodal large language models for facial action estimation, achieving not only high prediction accuracy and perceptual consistency but also significantly enhanced generalization and robustness across identities and domains, including stylized faces such as cartoons.

Technology Category

Application Category

📝 Abstract

Facial action estimation from a single image is often formulated as predicting or fitting parameters in compact expression spaces, which lack explicit semantic interpretability. However, many practical applications, such as avatar control and human-computer interaction, require interpretable facial actions that correspond to meaningful muscle movements. In this work, we propose \textbf{SemanticFace}, a framework for facial action estimation in the interpretable ARKit blendshape space that reformulates coefficient prediction as structured semantic reasoning. SemanticFace adopts a two-stage semantic distillation paradigm: it first derives structured semantic supervision from ground-truth ARKit coefficients and then distills this knowledge into a multimodal large language model to predict interpretable facial action coefficients from images. Extensive experiments demonstrate that language-aligned semantic supervision improves both coefficient accuracy and perceptual consistency, while enabling strong cross-identity generalization and robustness to large domain shifts, including cartoon faces.

Problem

Research questions and friction points this paper is trying to address.

facial action estimation

semantic interpretability

ARKit blendshape

interpretable space

single-image

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic distillation

interpretable facial action

ARKit blendshape