🤖 AI Summary
Monocular 3D clothed human reconstruction is often hindered by scarce texture data, inaccurate geometric priors, and supervision bias inherent to single-modality learning, leading to suboptimal reconstruction quality. To address these limitations, this work proposes a geometry–texture collaborative reconstruction framework. We construct a large-scale dataset comprising over 15,000 textured 3D human scans and introduce a multi-source texture synthesis strategy, a region-aware shape extraction module, and a Fourier-based geometric encoding mechanism. A dual-branch U-Net architecture is further designed to effectively fuse geometry and texture features. By transcending the constraints of single-modality supervision, our method achieves state-of-the-art performance across multiple benchmarks and in-the-wild images, enabling high-fidelity, high-quality 3D reconstruction of clothed humans from a single image.
📝 Abstract
Monocular 3D clothed human reconstruction aims to generate a complete and realistic textured 3D avatar from a single image. Existing methods are commonly trained under multi-view supervision with annotated geometric priors, and during inference, these priors are estimated by the pre-trained network from the monocular input. These methods are constrained by three key limitations: texturally by unavailability of training data, geometrically by inaccurate external priors, and systematically by biased single-modality supervision, all leading to suboptimal reconstruction. To address these issues, we propose a novel reconstruction framework, named MultiGO++, which achieves effective systematic geometry-texture collaboration. It consists of three core parts: (1) A multi-source texture synthesis strategy that constructs 15,000+ 3D textured human scans to improve the performance on texture quality estimation in challenge scenarios; (2) A region-aware shape extraction module that extracts and interacts features of each body region to obtain geometry information and a Fourier geometry encoder that mitigates the modality gap to achieve effective geometry learning; (3) A dual reconstruction U-Net that leverages geometry-texture collaborative features to refine and generate high-fidelity textured 3D human meshes. Extensive experiments on two benchmarks and many in-the-wild cases show the superiority of our method over state-of-the-art approaches.