π€ AI Summary
This work addresses the lack of non-invasive, calibration-free solutions for real-time facial expression capture in current virtual reality systems. The authors propose a distillation-based heterogeneous data training framework that leverages egocentric facial images captured by built-in infrared cameras in VR headsets. By fusing synthetic and real multi-source data and integrating a lightweight capture system with a differentiable rendering pipeline, the method achieves high-fidelity, calibration-free facial animation. The study introduces a large-scale facial dataset encompassing 18,000 diverse users and demonstrates low-latency real-time performance under a mobileβVR collaborative architecture, making it suitable for applications such as video conferencing, gaming, and remote collaboration.
π Abstract
We present a novel system for real-time tracking of facial expressions using egocentric views captured from a set of infrared cameras embedded in a virtual reality (VR) headset. Our technology facilitates any user to accurately drive the facial expressions of virtual characters in a non-intrusive manner and without the need of a lengthy calibration step. At the core of our system is a distillation based approach to train a machine learning model on heterogeneous data and labels coming form multiple sources, e.g. synthetic and real images. As part of our dataset, we collected 18k diverse subjects using a lightweight capture setup consisting of a mobile phone and a custom VR headset with extra cameras. To process this data, we developed a robust differentiable rendering pipeline enabling us to automatically extract facial expression labels. Our system opens up new avenues for communication and expression in virtual environments, with applications in video conferencing, gaming, entertainment, and remote collaboration.