🤖 AI Summary
To address incomplete and geometrically inconsistent 3D object reconstruction from sparse, pose-free, and partially occluded multi-view inputs, this paper proposes a generative 3D reconstruction framework. Our method jointly models visible and invisible regions by introducing two novel attention mechanisms: view-wise cross-attention and stereo-conditioned cross-attention. These modules tightly integrate 2D amodal completion priors with multi-view stereo geometric constraints, enabling geometrically plausible and appearance-consistent inference of unobserved structures. Evaluated on both synthetic and real-world datasets, our approach significantly improves reconstruction completeness and fidelity over conventional multi-view reconstruction and single-view inpainting methods. It achieves state-of-the-art performance in recovering occluded geometry while preserving structural coherence and visual consistency across views. This robust solution advances object-level 3D understanding for applications including robotic grasping and AR/VR.
📝 Abstract
Reconstructing 3D objects from a few unposed and partially occluded views is a common yet challenging problem in real-world scenarios, where many object surfaces are never directly observed. Traditional multi-view or inpainting-based approaches struggle under such conditions, often yielding incomplete or geometrically inconsistent reconstructions. We introduce AmodalGen3D, a generative framework for amodal 3D object reconstruction that infers complete, occlusion-free geometry and appearance from arbitrary sparse inputs. The model integrates 2D amodal completion priors with multi-view stereo geometry conditioning, supported by a View-Wise Cross Attention mechanism for sparse-view feature fusion and a Stereo-Conditioned Cross Attention module for unobserved structure inference. By jointly modeling visible and hidden regions, AmodalGen3D faithfully reconstructs 3D objects that are consistent with sparse-view constraints while plausibly hallucinating unseen parts. Experiments on both synthetic and real-world datasets demonstrate that AmodalGen3D achieves superior fidelity and completeness under occlusion-heavy sparse-view settings, addressing a pressing need for object-level 3D scene reconstruction in robotics, AR/VR, and embodied AI applications.