ExFace: Expressive Facial Control for Humanoid Robots with Diffusion Transformers and Bootstrap Training

📅 2025-04-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of achieving high-precision, real-time mapping from human facial expressions to facial motor control in humanoid robots, aiming to enhance expression naturalness, motion fluency, and interactive responsiveness. We propose the first diffusion Transformer architecture specifically designed for robot facial control, integrating bootstrapped training and a novel blendshape-to-motor mapping model. Additionally, we introduce ExFace—the first benchmark dataset dedicated to face-driven robotic facial motion. Our method achieves state-of-the-art performance in expression reconstruction accuracy, real-time inference speed (>30 FPS), and end-to-end latency (<80 ms), significantly advancing real-time anthropomorphic expressivity. The framework has been successfully deployed on multiple humanoid robot platforms, enabling natural expressive performances and high-fidelity human–robot interaction.

Technology Category

Application Category

📝 Abstract
This paper presents a novel Expressive Facial Control (ExFace) method based on Diffusion Transformers, which achieves precise mapping from human facial blendshapes to bionic robot motor control. By incorporating an innovative model bootstrap training strategy, our approach not only generates high-quality facial expressions but also significantly improves accuracy and smoothness. Experimental results demonstrate that the proposed method outperforms previous methods in terms of accuracy, frame per second (FPS), and response time. Furthermore, we develop the ExFace dataset driven by human facial data. ExFace shows excellent real-time performance and natural expression rendering in applications such as robot performances and human-robot interactions, offering a new solution for bionic robot interaction.
Problem

Research questions and friction points this paper is trying to address.

Precise mapping from human to robot facial expressions
Improving expression accuracy and smoothness via bootstrap training
Enhancing real-time performance in human-robot interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformers for facial control
Bootstrap training enhances expression quality
ExFace dataset from human facial data
🔎 Similar Papers
No similar papers found.
D
Dong Zhang
School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China
J
Jingwei Peng
School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China
Y
Yuyang Jiao
School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China
Jiayuan Gu
Jiayuan Gu
Assistant Professor, ShanghaiTech University
Embodied AI3D Vision
Jingyi Yu
Jingyi Yu
Professor, ShanghaiTech University
Computer VisionComputer Graphics
J
Jiahao Chen
School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China