🤖 AI Summary
To address the challenge of generating high-fidelity 3D models from a single input image—exacerbated by the scarcity of 3D data in robotics—this paper proposes a two-stage NeRF-based generative framework. Its core innovation is the first introduction of mask-aware, subject-specific geometric and textural priors, explicitly embedded into the NeRF optimization pipeline to enable pixel-level alignment during both geometry reconstruction and texture refinement. Unlike generic diffusion priors—which trade off fidelity for broad generalizability—our approach significantly improves consistency between generated 3D outputs and the input image. Extensive experiments across multiple object categories demonstrate state-of-the-art performance in both geometric accuracy and texture realism. Moreover, the method substantially enhances the efficiency and diversity of 3D asset generation for robot simulation training.
📝 Abstract
In this paper, we address the critical bottleneck in robotics caused by the scarcity of diverse 3D data by presenting a novel two-stage approach for generating high-quality 3D models from a single image. This method is motivated by the need to efficiently expand 3D asset creation, particularly for robotics datasets, where the variety of object types is currently limited compared to general image datasets. Unlike previous methods that primarily rely on general diffusion priors, which often struggle to align with the reference image, our approach leverages subject-specific prior knowledge. By incorporating subject-specific priors in both geometry and texture, we ensure precise alignment between the generated 3D content and the reference object. Specifically, we introduce a shading mode-aware prior into the NeRF optimization process, enhancing the geometry and refining texture in the coarse outputs to achieve superior quality. Extensive experiments demonstrate that our method significantly outperforms prior approaches.