🤖 AI Summary
To address the generation bottleneck caused by 3D data scarcity and modeling complexity, this paper proposes the first end-to-end method for generating 3D Gaussian Splatting (3DGS) representations directly from pretrained 2D diffusion models. Our approach reframes single-view image generation as a multi-view collaborative reconstruction task. We introduce a novel cross-view and cross-attribute attention mechanism to explicitly enforce geometric-appearance consistency across views. Furthermore, we impose implicit 3D consistency constraints—eliminating the need for any 3D supervision or annotations. Evaluated on both synthetic and real-world datasets, our method significantly outperforms existing approaches in zero-shot, high-fidelity, and high-quality 3D object generation. It achieves superior generalization and training efficiency while requiring only standard 2D image inputs.
📝 Abstract
Recent advances in 2D image generation have achieved remarkable quality,largely driven by the capacity of diffusion models and the availability of large-scale datasets. However, direct 3D generation is still constrained by the scarcity and lower fidelity of 3D datasets. In this paper, we introduce Zero-1-to-G, a novel approach that addresses this problem by enabling direct single-view generation on Gaussian splats using pretrained 2D diffusion models. Our key insight is that Gaussian splats, a 3D representation, can be decomposed into multi-view images encoding different attributes. This reframes the challenging task of direct 3D generation within a 2D diffusion framework, allowing us to leverage the rich priors of pretrained 2D diffusion models. To incorporate 3D awareness, we introduce cross-view and cross-attribute attention layers, which capture complex correlations and enforce 3D consistency across generated splats. This makes Zero-1-to-G the first direct image-to-3D generative model to effectively utilize pretrained 2D diffusion priors, enabling efficient training and improved generalization to unseen objects. Extensive experiments on both synthetic and in-the-wild datasets demonstrate superior performance in 3D object generation, offering a new approach to high-quality 3D generation.