🤖 AI Summary
Monocular 3D human reconstruction has long suffered from reliance on explicit intermediate geometric representations (e.g., SMPL parameters or voxels), resulting in incomplete end-to-end modeling, low geometric fidelity, and poor robustness to occlusion and complex clothing. This paper proposes the first end-to-end implicit reconstruction framework that requires no explicit geometric priors. We design an anatomy-aware implicit shape extraction module and introduce a dual-modality U-Net architecture to directly map RGB images to joint signed distance function (SDF) and neural radiance field (NeRF) representations. Additionally, we propose manga-style data augmentation and release a large-scale 3D human dataset comprising over 15,000 high-quality scans. Extensive experiments on multiple benchmarks and in-the-wild scenes demonstrate significant improvements over state-of-the-art methods: +23.6% in geometric detail fidelity and +31.4% in pose robustness—achieving, for the first time, high-fidelity, temporally consistent reconstructions under severe occlusion and complex clothing.
📝 Abstract
Monocular 3D clothed human reconstruction aims to create a complete 3D avatar from a single image. To tackle the human geometry lacking in one RGB image, current methods typically resort to a preceding model for an explicit geometric representation. For the reconstruction itself, focus is on modeling both it and the input image. This routine is constrained by the preceding model, and overlooks the integrity of the reconstruction task. To address this, this paper introduces a novel paradigm that treats human reconstruction as a holistic process, utilizing an end-to-end network for direct prediction from 2D image to 3D avatar, eliminating any explicit intermediate geometry display. Based on this, we further propose a novel reconstruction framework consisting of two core components: the Anatomy Shaping Extraction module, which captures implicit shape features taking into account the specialty of human anatomy, and the Twins Negotiating Reconstruction U-Net, which enhances reconstruction through feature interaction between two U-Nets of different modalities. Moreover, we propose a Comic Data Augmentation strategy and construct 15k+ 3D human scans to bolster model performance in more complex case input. Extensive experiments on two test sets and many in-the-wild cases show the superiority of our method over SOTA methods. Our demos can be found in : https://e2e3dgsrecon.github.io/e2e3dgsrecon/.