Express4D: Expressive, Friendly, and Extensible 4D Facial Motion Generation Benchmark

📅 2025-08-17

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing facial expression generation datasets are constrained by speech-driven paradigms or coarse-grained emotion labels, lacking fine-grained semantic descriptions and relying on costly motion-capture systems. To address these limitations, we introduce the first lightweight benchmark for text-driven 4D facial animation: a high-fidelity dataset captured using consumer-grade RGB-D sensors, parameterized via ARKit blendshapes, and augmented with rich, nuanced natural language instructions automatically generated by large language models. This enables many-to-many text-to-motion mappings. Building upon this resource, we train and evaluate a text-to-facial-motion generation model, demonstrating significant improvements in both semantic fidelity—accurately interpreting diverse linguistic descriptions—and expressive diversity. The dataset, training code, and pretrained models are fully open-sourced to foster reproducible research and community advancement.

Technology Category

Application Category

📝 Abstract

Dynamic facial expression generation from natural language is a crucial task in Computer Graphics, with applications in Animation, Virtual Avatars, and Human-Computer Interaction. However, current generative models suffer from datasets that are either speech-driven or limited to coarse emotion labels, lacking the nuanced, expressive descriptions needed for fine-grained control, and were captured using elaborate and expensive equipment. We hence present a new dataset of facial motion sequences featuring nuanced performances and semantic annotation. The data is easily collected using commodity equipment and LLM-generated natural language instructions, in the popular ARKit blendshape format. This provides riggable motion, rich with expressive performances and labels. We accordingly train two baseline models, and evaluate their performance for future benchmarking. Using our Express4D dataset, the trained models can learn meaningful text-to-expression motion generation and capture the many-to-many mapping of the two modalities. The dataset, code, and video examples are available on our webpage: https://jaron1990.github.io/Express4D/

Problem

Research questions and friction points this paper is trying to address.

Lack of nuanced facial motion datasets for text-to-expression generation

Existing datasets rely on expensive equipment or coarse emotion labels

Need for expressive, riggable 4D facial motion with semantic annotations

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-generated natural language instructions dataset

ARKit blendshape format for riggable motion

Text-to-expression motion generation models

🔎 Similar Papers

EmoVOCA: Speech-Driven Emotional 3D Talking Heads