WildCap: Facial Appearance Capture in the Wild via Hybrid Inverse Rendering

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing facial appearance capture methods rely on controlled illumination, limiting practicality and deployment cost. This paper proposes the first high-fidelity facial appearance reconstruction framework tailored for smartphone videos captured under natural illumination. Methodologically: (1) we design a hybrid inverse rendering architecture that integrates data-driven preprocessing with model-driven optimization; (2) we introduce texel-level physically grounded illumination modeling to mitigate neural artifacts and alleviate the albedo-illumination scale ambiguity; (3) we jointly optimize geometry, reflectance, and illumination by coupling diffusion-based priors with illumination estimation. Evaluated on the SwitchLight dataset, our method significantly outperforms state-of-the-art approaches on in-the-wild videos, achieving detail fidelity and illumination robustness comparable to those attained under controlled lighting—thereby bridging, for the first time, the performance gap between natural and controlled illumination settings.

Technology Category

Application Category

📝 Abstract
Existing methods achieve high-quality facial appearance capture under controllable lighting, which increases capture cost and limits usability. We propose WildCap, a novel method for high-quality facial appearance capture from a smartphone video recorded in the wild. To disentangle high-quality reflectance from complex lighting effects in in-the-wild captures, we propose a novel hybrid inverse rendering framework. Specifically, we first apply a data-driven method, i.e., SwitchLight, to convert the captured images into more constrained conditions and then adopt model-based inverse rendering. However, unavoidable local artifacts in network predictions, such as shadow-baking, are non-physical and thus hinder accurate inverse rendering of lighting and material. To address this, we propose a novel texel grid lighting model to explain non-physical effects as clean albedo illuminated by local physical lighting. During optimization, we jointly sample a diffusion prior for reflectance maps and optimize the lighting, effectively resolving scale ambiguity between local lights and albedo. Our method achieves significantly better results than prior arts in the same capture setup, closing the quality gap between in-the-wild and controllable recordings by a large margin. Our code will be released href{https://yxuhan.github.io/WildCap/index.html}{ extcolor{magenta}{here}}.
Problem

Research questions and friction points this paper is trying to address.

Captures high-quality facial appearance from smartphone videos in uncontrolled environments.
Disentangles reflectance from complex lighting effects via hybrid inverse rendering.
Resolves artifacts and scale ambiguity between lighting and albedo during optimization.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid inverse rendering framework for facial capture
Texel grid lighting model for non-physical artifacts
Joint diffusion prior sampling for reflectance optimization
🔎 Similar Papers
No similar papers found.
Yuxuan Han
Yuxuan Han
Tsinghua University
computer visioncomputer graphics
X
Xin Ming
School of Software and BNRist, Tsinghua University
T
Tianxiao Li
School of Software and BNRist, Tsinghua University
Z
Zhuofan Shen
School of Software and BNRist, Tsinghua University
Q
Qixuan Zhang
ShanghaiTech University
L
Lan Xu
ShanghaiTech University
F
Feng Xu
School of Software and BNRist, Tsinghua University