π€ AI Summary
This work addresses the challenge of single-image 3D generation under sparse supervision, where existing methods often fail to generalize across diverse semantic categories and complex structures, leading to fragmented or incomplete outputs. To overcome this, the authors propose an adaptive part-whole hierarchical modeling approach that infers soft composition masks from image tokens to dynamically identify and merge redundant structural slots, yielding compact yet expressive 3D representations. Key innovations include a category-agnostic learnable library of geometric prototypes, an adaptive slot gating mechanism, and a lightweight 3D diffusion denoiser, collectively enabling cross-category shape prior sharing and structural adaptability without predefining the number of parts. Experiments demonstrate significant improvements over state-of-the-art methods in cross-category transfer and part-count extrapolation, confirming the modelβs effectiveness and generalization capability.
π Abstract
Single-image 3D generation lies at the core of vision-to-graphics models in the real world. However, it remains a fundamental challenge to achieve reliable generalization across diverse semantic categories and highly variable structural complexity under sparse supervision. Existing approaches typically model objects in a monolithic manner or rely on a fixed number of parts, including recent part-aware models such as PartCrafter, which still require a labor-intensive user-specified part count. Such designs easily lead to overfitting, fragmented or missing structural components, and limited compositional generalization when encountering novel object layouts. To this end, this paper rethinks single-image 3D generation as learning an adaptive part-whole hierarchy in the flexible 3D latent space. We present a novel part-to-whole 3D generative world model that autonomously discovers latent structural slots by inferring soft and compositional masks directly from image tokens. Specifically, an adaptive slot-gating mechanism dynamically determines the slot-wise activation probabilities and smoothly consolidates redundant slots within different objects, ensuring that the emergent structure remains compact yet expressive across categories. Each distilled slot is then aligned to a learnable, class-agnostic prototype bank, enabling powerful cross-category shape sharing and denoising through universal geometric prototypes in the real world. Furthermore, a lightweight 3D denoiser is introduced to reconstruct geometry and appearance via unified diffusion objectives. Experiments show consistent gains in cross-category transfer and part-count extrapolation, and ablations confirm complementary benefits of the prototype bank for shape-prior sharing as well as slot-gating for structural adaptation.