🤖 AI Summary
Existing layout-aware 3D generation methods typically support only single-view inputs, limiting their ability to leverage complementary multi-view information and often producing physically implausible layouts—such as interpenetrating or floating objects—due to independently estimated object poses. To address these limitations, this work proposes a training-free multi-view fusion framework that enforces view consistency in 3D latent space through multiple diffusion processes. It introduces a novel confidence-aware adaptive fusion mechanism based on attention entropy and visibility weighting, and jointly optimizes collision and contact constraints both during and after generation to enhance physical plausibility. Experiments demonstrate that the proposed method significantly improves reconstruction fidelity and layout合理性 on standard benchmarks and real-world multi-object scenes, all without requiring additional training.
📝 Abstract
Recent unified 3D generation models have made remarkable progress in producing high-quality 3D assets from a single image. Notably, layout-aware approaches such as SAM3D can reconstruct multiple objects while preserving their spatial arrangement, opening the door to practical scene-level 3D generation. However, current methods are limited to single-view input and cannot leverage complementary multi-view observations, while independently estimated object poses often lead to physically implausible layouts such as interpenetration and floating artifacts.
We present MV-SAM3D, a training-free framework that extends layout-aware 3D generation with multi-view consistency and physical plausibility. We formulate multi-view fusion as a Multi-Diffusion process in 3D latent space and propose two adaptive weighting strategies -- attention-entropy weighting and visibility weighting -- that enable confidence-aware fusion, ensuring each viewpoint contributes according to its local observation reliability. For multi-object composition, we introduce physics-aware optimization that injects collision and contact constraints both during and after generation, yielding physically plausible object arrangements. Experiments on standard benchmarks and real-world multi-object scenes demonstrate significant improvements in reconstruction fidelity and layout plausibility, all without any additional training. Code is available at https://github.com/devinli123/MV-SAM3D.