🤖 AI Summary
This work addresses the challenge of jointly modeling single-image 3D human reconstruction and 3D semantic segmentation. We propose a unified feed-forward framework that synergistically models appearance and part-level semantics by fusing geometric priors with self-supervised semantic priors. A pixel-aligned feature aggregation mechanism is introduced to enhance cross-task consistency, while an interactive annotation strategy generates high-quality 3D semantic ground truth, alleviating the scarcity of labeled 3D human data. Our approach integrates generative modeling, multi-task joint optimization, and self-supervised learning—without requiring large-scale annotated 3D human datasets. Evaluated on standard benchmarks, our method achieves state-of-the-art performance in both 3D reconstruction (measured by texture fidelity) and 3D semantic segmentation (measured by accuracy), demonstrating significant improvements in geometric-semantic consistency.
📝 Abstract
Recent advances in generative models have achieved high-fidelity in 3D human reconstruction, yet their utility for specific tasks (e.g., human 3D segmentation) remains constrained. We propose HumanCrafter, a unified framework that enables the joint modeling of appearance and human-part semantics from a single image in a feed-forward manner. Specifically, we integrate human geometric priors in the reconstruction stage and self-supervised semantic priors in the segmentation stage. To address labeled 3D human datasets scarcity, we further develop an interactive annotation procedure for generating high-quality data-label pairs. Our pixel-aligned aggregation enables cross-task synergy, while the multi-task objective simultaneously optimizes texture modeling fidelity and semantic consistency. Extensive experiments demonstrate that HumanCrafter surpasses existing state-of-the-art methods in both 3D human-part segmentation and 3D human reconstruction from a single image.