🤖 AI Summary
Single-image multi-view synthesis often suffers from spatial inconsistency, degrading downstream 3D reconstruction quality. To address this, we propose a diffusion-based multi-view generation framework centered on a novel “latent-space weaving mechanism”: orthogonal plane projections align multi-view features, enabling aggregation and interpolation of view-specific encodings within a shared latent space for implicit cross-view scene modeling and collaborative reasoning. Our method enables fast, geometrically consistent novel-view synthesis—generating 16 high-fidelity, geometry-aligned views in just 15 seconds. Quantitatively, it surpasses state-of-the-art methods across image fidelity metrics (FID, LPIPS) and 3D reconstruction benchmarks (Chamfer distance, mIoU). Notably, it significantly improves single-image-driven neural radiance field (NeRF) and mesh reconstruction performance, demonstrating superior implicit 3D consistency and generalization.
📝 Abstract
Generating consistent multi-view images from a single image remains challenging. Lack of spatial consistency often degrades 3D mesh quality in surface reconstruction. To address this, we propose LoomNet, a novel multi-view diffusion architecture that produces coherent images by applying the same diffusion model multiple times in parallel to collaboratively build and leverage a shared latent space for view consistency. Each viewpoint-specific inference generates an encoding representing its own hypothesis of the novel view from a given camera pose, which is projected onto three orthogonal planes. For each plane, encodings from all views are fused into a single aggregated plane. These aggregated planes are then processed to propagate information and interpolate missing regions, combining the hypotheses into a unified, coherent interpretation. The final latent space is then used to render consistent multi-view images. LoomNet generates 16 high-quality and coherent views in just 15 seconds. In our experiments, LoomNet outperforms state-of-the-art methods on both image quality and reconstruction metrics, also showing creativity by producing diverse, plausible novel views from the same input.