🤖 AI Summary
This work addresses two key challenges in natural language-driven high-expressivity 3D facial animation generation: weak semantic understanding and lack of temporal structure. To this end, we propose the first text-to-keyframe animation generation framework. Methodologically: (1) We introduce KeyframeFace, the first large-scale multimodal keyframe dataset comprising 2,100 scripted sequences, annotated with ARKit facial parameters and multi-view semantic labels generated by LLMs and multimodal LLMs (MLLMs); (2) We design a conditional diffusion model that jointly leverages large language model priors and interpretable ARKit coefficients, enabling keyframe-level semantic supervision and explicit temporal modeling. Experiments demonstrate significant improvements in semantic fidelity, motion coherence, and emotional expressiveness over prior methods. Both the code and the KeyframeFace dataset are publicly released.
📝 Abstract
Generating dynamic 3D facial animation from natural language requires understanding both temporally structured semantics and fine-grained expression changes. Existing datasets and methods mainly focus on speech-driven animation or unstructured expression sequences and therefore lack the semantic grounding and temporal structures needed for expressive human performance generation. In this work, we introduce KeyframeFace, a large-scale multimodal dataset designed for text-to-animation research through keyframe-level supervision. KeyframeFace provides 2,100 expressive scripts paired with monocular videos, per-frame ARKit coefficients, contextual backgrounds, complex emotions, manually defined keyframes, and multi-perspective annotations based on ARKit coefficients and images via Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Beyond the dataset, we propose the first text-to-animation framework that explicitly leverages LLM priors for interpretable facial motion synthesis. This design aligns the semantic understanding capabilities of LLMs with the interpretable structure of ARKit's coefficients, enabling high-fidelity expressive animation. KeyframeFace and our LLM-based framework together establish a new foundation for interpretable, keyframe-guided, and context-aware text-to-animation. Code and data are available at https://github.com/wjc12345123/KeyframeFace.