🤖 AI Summary
This work addresses the challenging problem of disentangling facial reflectance from unknown illumination in unconstrained videos captured under ordinary conditions. To this end, we propose a deep learning–based inverse rendering prior network that introduces a novel Dataset Latent Modulation (DLM) mechanism. This mechanism effectively integrates heterogeneous data sources by decoupling dataset-specific stylistic biases from the underlying physics of inverse rendering, thereby significantly enhancing model generalization. Trained on OLAT and Light Stage–rendered scan data, our model employs learnable source-aware tokens for conditional modeling. Extensive experiments demonstrate that our approach substantially outperforms existing methods. Additionally, we release NeRSemble-Scan, a high-fidelity 4K relightable facial scan dataset, to advance research in digital human avatars.
📝 Abstract
High-quality facial appearance capture has traditionally required costly studio recording. Recent works consider an in-the-wild smartphone-based setup; however, their model-based inverse rendering paradigm struggles with the complex disentanglement of reflectance from unknown illumination. To bridge this gap, we propose to shift the paradigm into training a powerful delighting network as a prior to constrain the optimization. We leverage the OLAT dataset and the rendered Light Stage scans for training, and propose Dataset Latent Modulation (DLM) to seamlessly integrate these heterogeneous data sources. Specifically, by conditioning the core network on learnable source-aware tokens, we decouple dataset-specific styles from physical delighting principles, enabling the emergence of a delighting prior that outperforms existing proprietary models. This powerful delighting prior enables a simple and automatic appearance capture pipeline that achieves high-quality reflectance estimation from casual video inputs, outperforming prior arts by a large margin. Furthermore, we leverage our appearance capture method to transform the multi-view NeRSemble dataset into NeRSemble-Scan, a large-scale collection of 4K-resolution relightable scans. By open-sourcing our model and the NeRSemble-Scan dataset, we democratize high-end facial capture and provide a new foundation for the research community to build photorealistic digital humans.