🤖 AI Summary
To address the reliance of high-fidelity, relightable, and editable 3D head reconstruction on costly multi-camera systems, this paper proposes a lightweight solution based on a single smartphone—requiring only a polarizing filter, a point light source, and a darkroom environment for dynamic facial video capture. Methodologically, we introduce a novel polarization-based dichroic separation of skin’s diffuse and specular reflectance; propose a hybrid representation embedding 2D Gaussians in UV space; and design a neural analysis-synthesis framework that explicitly decouples geometry deformation from appearance. Key technical components include cross- and co-polarized video acquisition, parametric UV mapping, differentiable ray tracing, environment light map estimation, and a newly curated multi-subject facial motion dataset. Experiments demonstrate that our method achieves geometric and material fidelity comparable to Light Stage on real smartphone-captured data, enabling real-time rendering, arbitrary relighting, and pose- or expression-driven animation.
📝 Abstract
Creating photorealistic, animatable, and relightable 3D head avatars traditionally requires expensive Lightstage with multiple calibrated cameras, making it inaccessible for widespread adoption. To bridge this gap, we present a novel, cost-effective approach for creating high-quality relightable head avatars using only a smartphone equipped with polaroid filters. Our approach involves simultaneously capturing cross-polarized and parallel-polarized video streams in a dark room with a single point-light source, separating the skin's diffuse and specular components during dynamic facial performances. We introduce a hybrid representation that embeds 2D Gaussians in the UV space of a parametric head model, facilitating efficient real-time rendering while preserving high-fidelity geometric details. Our learning-based neural analysis-by-synthesis pipeline decouples pose and expression-dependent geometrical offsets from appearance, decomposing the surface into albedo, normal, and specular UV texture maps, along with the environment maps. We collect a unique dataset of various subjects performing diverse facial expressions and head movements.