🤖 AI Summary
Single-image 3D scene generation with multiple objects faces severe challenges including heavy occlusion and object coupling-induced geometric distortion. To address these, we propose a two-stage differentiable framework: first, leveraging off-the-shelf image-to-3D models to independently reconstruct per-object meshes; second, jointly optimizing global scene layout via differentiable rendering, incorporating a novel optimal transport-driven long-range appearance loss and a high-level semantic loss in a synergistic constraint mechanism—enabling unified modeling of object-level geometric independence and scene-level structural consistency. Our approach integrates differentiable rendering, optimal transport theory, and semantics-guided gradient optimization. Evaluated on multi-object benchmarks, our method significantly improves geometric detail fidelity, object separation, and global coherence, outperforming state-of-the-art single-image 3D generation methods both quantitatively and qualitatively.
📝 Abstract
Traditional image-to-3D models often struggle with scenes containing multiple objects due to biases and occlusion complexities. To address this challenge, we present REPARO, a novel approach for compositional 3D asset generation from single images. REPARO employs a two-step process: first, it extracts individual objects from the scene and reconstructs their 3D meshes using off-the-shelf image-to-3D models; then, it optimizes the layout of these meshes through differentiable rendering techniques, ensuring coherent scene composition. By integrating optimal transport-based long-range appearance loss term and high-level semantic loss term in the differentiable rendering, REPARO can effectively recover the layout of 3D assets. The proposed method can significantly enhance object independence, detail accuracy, and overall scene coherence. Extensive evaluation of multi-object scenes demonstrates that our REPARO offers a comprehensive approach to address the complexities of multi-object 3D scene generation from single images.