KeyframeFace: From Text to Expressive Facial Keyframes

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key challenges in natural language-driven high-expressivity 3D facial animation generation: weak semantic understanding and lack of temporal structure. To this end, we propose the first text-to-keyframe animation generation framework. Methodologically: (1) We introduce KeyframeFace, the first large-scale multimodal keyframe dataset comprising 2,100 scripted sequences, annotated with ARKit facial parameters and multi-view semantic labels generated by LLMs and multimodal LLMs (MLLMs); (2) We design a conditional diffusion model that jointly leverages large language model priors and interpretable ARKit coefficients, enabling keyframe-level semantic supervision and explicit temporal modeling. Experiments demonstrate significant improvements in semantic fidelity, motion coherence, and emotional expressiveness over prior methods. Both the code and the KeyframeFace dataset are publicly released.

Technology Category

Application Category

📝 Abstract
Generating dynamic 3D facial animation from natural language requires understanding both temporally structured semantics and fine-grained expression changes. Existing datasets and methods mainly focus on speech-driven animation or unstructured expression sequences and therefore lack the semantic grounding and temporal structures needed for expressive human performance generation. In this work, we introduce KeyframeFace, a large-scale multimodal dataset designed for text-to-animation research through keyframe-level supervision. KeyframeFace provides 2,100 expressive scripts paired with monocular videos, per-frame ARKit coefficients, contextual backgrounds, complex emotions, manually defined keyframes, and multi-perspective annotations based on ARKit coefficients and images via Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Beyond the dataset, we propose the first text-to-animation framework that explicitly leverages LLM priors for interpretable facial motion synthesis. This design aligns the semantic understanding capabilities of LLMs with the interpretable structure of ARKit's coefficients, enabling high-fidelity expressive animation. KeyframeFace and our LLM-based framework together establish a new foundation for interpretable, keyframe-guided, and context-aware text-to-animation. Code and data are available at https://github.com/wjc12345123/KeyframeFace.
Problem

Research questions and friction points this paper is trying to address.

Generates expressive 3D facial animation from natural language text.
Addresses lack of semantic grounding and temporal structure in existing methods.
Introduces a dataset and framework using LLMs for interpretable motion synthesis.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale multimodal dataset with keyframe-level supervision
First text-to-animation framework using LLM priors
Aligns LLM semantics with interpretable ARKit coefficients
🔎 Similar Papers
No similar papers found.
J
Jingchao Wu
Westlake University, Nanjing University
Z
Zejian Kang
Zhejiang University, Westlake University
H
Haibo Liu
Westlake University
Y
Yuanchen Fei
Westlake University, Hunan University
Xiangru Huang
Xiangru Huang
Westlake University
Machine Learning and OptimizationGeometry ProcessingDeep Learning