🤖 AI Summary
Single-view 3D reconstruction suffers from multi-view inconsistency in diffusion-generated views and exacerbated geometric incoherence when leveraging large reconstruction models (LRMs). To address this, we propose CDI3D—a novel end-to-end feedforward framework featuring a Dense View Interpolation (DVI) module that models oblique camera trajectories to achieve geometrically consistent dense interpolation between diffusion-synthesized views. For the first time, interpolated views are jointly fed with original input views into a triplane-based reconstruction network, enabling joint encoding of triplane features and implicit grid decoding. Our method achieves state-of-the-art performance across multiple benchmarks: it significantly improves 3D mesh geometric accuracy and texture fidelity, while offering superior inference efficiency compared to iterative optimization approaches—effectively balancing high fidelity and computational efficiency.
📝 Abstract
3D object reconstruction from single-view image is a fundamental task in computer vision with wide-ranging applications. Recent advancements in Large Reconstruction Models (LRMs) have shown great promise in leveraging multi-view images generated by 2D diffusion models to extract 3D content. However, challenges remain as 2D diffusion models often struggle to produce dense images with strong multi-view consistency, and LRMs tend to amplify these inconsistencies during the 3D reconstruction process. Addressing these issues is critical for achieving high-quality and efficient 3D reconstruction. In this paper, we present CDI3D, a feed-forward framework designed for efficient, high-quality image-to-3D generation with view interpolation. To tackle the aforementioned challenges, we propose to integrate 2D diffusion-based view interpolation into the LRM pipeline to enhance the quality and consistency of the generated mesh. Specifically, our approach introduces a Dense View Interpolation (DVI) module, which synthesizes interpolated images between main views generated by the 2D diffusion model, effectively densifying the input views with better multi-view consistency. We also design a tilt camera pose trajectory to capture views with different elevations and perspectives. Subsequently, we employ a tri-plane-based mesh reconstruction strategy to extract robust tokens from these interpolated and original views, enabling the generation of high-quality 3D meshes with superior texture and geometry. Extensive experiments demonstrate that our method significantly outperforms previous state-of-the-art approaches across various benchmarks, producing 3D content with enhanced texture fidelity and geometric accuracy.