🤖 AI Summary
Existing methods for estimating facial action parameters from a single image often lack semantic interpretability, limiting their applicability in scenarios such as virtual character control that require muscle-level motion semantics. To address this, this work proposes SemanticFace, a framework that introduces a two-stage semantic distillation process within the ARKit blendshape space. First, structured semantic supervision signals are derived from ground-truth blendshape coefficients; these signals are then distilled into a multimodal large language model to enable end-to-end prediction of interpretable facial action parameters directly from images. This study presents the first integration of structured semantic reasoning with multimodal large language models for facial action estimation, achieving not only high prediction accuracy and perceptual consistency but also significantly enhanced generalization and robustness across identities and domains, including stylized faces such as cartoons.
📝 Abstract
Facial action estimation from a single image is often formulated as predicting or fitting parameters in compact expression spaces, which lack explicit semantic interpretability. However, many practical applications, such as avatar control and human-computer interaction, require interpretable facial actions that correspond to meaningful muscle movements. In this work, we propose \textbf{SemanticFace}, a framework for facial action estimation in the interpretable ARKit blendshape space that reformulates coefficient prediction as structured semantic reasoning. SemanticFace adopts a two-stage semantic distillation paradigm: it first derives structured semantic supervision from ground-truth ARKit coefficients and then distills this knowledge into a multimodal large language model to predict interpretable facial action coefficients from images. Extensive experiments demonstrate that language-aligned semantic supervision improves both coefficient accuracy and perceptual consistency, while enabling strong cross-identity generalization and robustness to large domain shifts, including cartoon faces.