🤖 AI Summary
This work addresses the challenging problem of generating geometrically accurate and multi-view consistent 3D editable garment models from a single input image. We propose a progressive depth-guided multi-view diffusion modeling framework that jointly leverages single-image depth estimation, differentiable image warping, and RGB-depth co-inference. A deformation field explicitly encodes garment geometry priors, guiding a multi-view conditional diffusion model to reconstruct both texture and geometry in a coordinated manner. Our key contribution is the first integration of image warping as an explicit geometric constraint into the diffusion process, enabling end-to-end generation of multi-view consistent 3D garments from a single image. Experiments demonstrate significant improvements over state-of-the-art methods in visual fidelity, structural accuracy, and cross-view consistency. Moreover, the generated models support intuitive post-hoc editing, making the approach accessible to non-expert users.
📝 Abstract
We introduce GarmentCrafter, a new approach that enables non-professional users to create and modify 3D garments from a single-view image. While recent advances in image generation have facilitated 2D garment design, creating and editing 3D garments remains challenging for non-professional users. Existing methods for single-view 3D reconstruction often rely on pre-trained generative models to synthesize novel views conditioning on the reference image and camera pose, yet they lack cross-view consistency, failing to capture the internal relationships across different views. In this paper, we tackle this challenge through progressive depth prediction and image warping to approximate novel views. Subsequently, we train a multi-view diffusion model to complete occluded and unknown clothing regions, informed by the evolving camera pose. By jointly inferring RGB and depth, GarmentCrafter enforces inter-view coherence and reconstructs precise geometries and fine details. Extensive experiments demonstrate that our method achieves superior visual fidelity and inter-view coherence compared to state-of-the-art single-view 3D garment reconstruction methods.