Multi-scale Attention-Guided Intrinsic Decomposition and Rendering Pass Prediction for Facial Images

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address low intrinsic decomposition accuracy and insufficient rendering fidelity of facial images under unconstrained illumination, this paper proposes MAGINet—a multi-scale attention-guided network that achieves, for the first time, physically consistent joint prediction of six intrinsic channels: diffuse albedo, ambient occlusion, surface normals, specular reflectance, roughness, and metallic maps. The method integrates hierarchical residual encoding, a spatial-channel joint attention bottleneck, and adaptive multi-scale decoding, augmented by a high-resolution (1024×1024) RefinementNet operating in a coarse-to-fine manner. Leveraging the Pix2PixHD framework, MAGINet is trained end-to-end with a composite loss comprising masked MSE, VGG perceptual, edge-aware, and patch-based LPIPS terms. On FFHQ-UV-Intrinsics, MAGINet achieves state-of-the-art performance in diffuse albedo estimation and significantly outperforms prior methods in overall six-channel rendering quality, enabling high-fidelity relighting and material editing of faces.

Technology Category

Application Category

📝 Abstract

Accurate intrinsic decomposition of face images under unconstrained lighting is a prerequisite for photorealistic relighting, high-fidelity digital doubles, and augmented-reality effects. This paper introduces MAGINet, a Multi-scale Attention-Guided Intrinsics Network that predicts a $512 imes512$ light-normalized diffuse albedo map from a single RGB portrait. MAGINet employs hierarchical residual encoding, spatial-and-channel attention in a bottleneck, and adaptive multi-scale feature fusion in the decoder, yielding sharper albedo boundaries and stronger lighting invariance than prior U-Net variants. The initial albedo prediction is upsampled to $1024 imes1024$ and refined by a lightweight three-layer CNN (RefinementNet). Conditioned on this refined albedo, a Pix2PixHD-based translator then predicts a comprehensive set of five additional physically based rendering passes: ambient occlusion, surface normal, specular reflectance, translucency, and raw diffuse colour (with residual lighting). Together with the refined albedo, these six passes form the complete intrinsic decomposition. Trained with a combination of masked-MSE, VGG, edge, and patch-LPIPS losses on the FFHQ-UV-Intrinsics dataset, the full pipeline achieves state-of-the-art performance for diffuse albedo estimation and demonstrates significantly improved fidelity for the complete rendering stack compared to prior methods. The resulting passes enable high-quality relighting and material editing of real faces.

Problem

Research questions and friction points this paper is trying to address.

Decomposes facial images into intrinsic components for relighting

Predicts high-resolution diffuse albedo and five rendering passes

Enables photorealistic facial editing under unconstrained lighting conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale attention-guided network for albedo prediction

Lightweight CNN refines albedo and predicts rendering passes

Combined losses train pipeline for state-of-the-art decomposition

🔎 Similar Papers

FitDiff: Robust monocular 3D facial shape and reflectance estimation using Diffusion Models