🤖 AI Summary
This work proposes a real-time, high-fidelity facial expression shadowing system for humanoid robots to overcome the limitations of existing approaches, which often sacrifice either responsiveness or realism due to offline processing and insufficient detail transfer. The system leverages a novel cross-modal translation network, X2CNet++, combined with a feature-adaptive training strategy, significantly improving motion alignment accuracy from human faces to robotic facial actuators. A streaming video inference pipeline and an asynchronous I/O-driven communication mechanism are co-designed to enable efficient device coordination. The resulting framework achieves facial expression mapping within 50 milliseconds, demonstrates strong generalization across diverse robotic facial morphologies, and exhibits superior real-time performance, expressiveness, and practicality, as validated through extensive real-world experiments.
📝 Abstract
Humanoid facial expression shadowing enables robots to realistically imitate human facial expressions in real time, which is critical for lifelike, facially expressive humanoid robots and affective human-robot interaction. Existing progress in humanoid facial expression imitation remains limited, often failing to achieve either real-time performance or realistic expressiveness due to offline video-based inference designs and insufficient ability to capture and transfer subtle expression details. To address these limitations, we present VividFace, a real-time and realistic facial expression shadowing system for humanoid robots. An optimized imitation framework X2CNet++ enhances expressiveness by fine-tuning the human-to-humanoid facial motion transfer module and introducing a feature-adaptation training strategy for better alignment across different image sources. Real-time shadowing is further enabled by a video-stream-compatible inference pipeline and a streamlined workflow based on asynchronous I/O for efficient communication across devices. VividFace produces vivid humanoid faces by mimicking human facial expressions within 0.05 seconds, while generalizing across diverse facial configurations. Extensive real-world demonstrations validate its practical utility. Videos are available at: https://lipzh5.github.io/VividFace/.