๐ค AI Summary
Realistic facial avatar animation in VR/AR is hindered by the scarcity of ground-truth facial state annotations: head-mounted infrared cameras (HMCs) offer limited observability, while external dome cameras provide full-coverage ground truth but cannot be easily synchronized with HMCsโresulting in costly, low-generalizability paired data acquisition. This paper introduces GenHMC, the first generative method for synthesizing high-fidelity HMC images from unpaired data. GenHMC disentangles expression, viewpoint, and identity appearance, conditioning synthesis on full-coverage dome-captured observations to generate photorealistic HMC views from monocular infrared inputs. It requires no subject-aligned paired capture, enabling cross-identity generalization and robustness to varying viewpoints and lighting. Experiments demonstrate that GenHMC substantially improves facial data efficiency and reconstruction accuracy; downstream face encoders trained on its synthetic data achieve state-of-the-art performance across multiple benchmarks.
๐ Abstract
Enabling photorealistic avatar animations in virtual and augmented reality (VR/AR) has been challenging because of the difficulty of obtaining ground truth state of faces. It is physically impossible to obtain synchronized images from head-mounted cameras (HMC) sensing input, which has partial observations in infrared (IR), and an array of outside-in dome cameras, which have full observations that match avatars' appearance. Prior works relying on analysis-by-synthesis methods could generate accurate ground truth, but suffer from imperfect disentanglement between expression and style in their personalized training. The reliance of extensive paired captures (HMC and dome) for the same subject makes it operationally expensive to collect large-scale datasets, which cannot be reused for different HMC viewpoints and lighting. In this work, we propose a novel generative approach, Generative HMC (GenHMC), that leverages large unpaired HMC captures, which are much easier to collect, to directly generate high-quality synthetic HMC images given any conditioning avatar state from dome captures. We show that our method is able to properly disentangle the input conditioning signal that specifies facial expression and viewpoint, from facial appearance, leading to more accurate ground truth. Furthermore, our method can generalize to unseen identities, removing the reliance on the paired captures. We demonstrate these breakthroughs by both evaluating synthetic HMC images and universal face encoders trained from these new HMC-avatar correspondences, which achieve better data efficiency and state-of-the-art accuracy.