🤖 AI Summary
Existing single-image 3D generation methods struggle to maintain multi-view geometric and textural consistency under large pose variations, leading to degraded reconstruction quality. To address this, we propose a diffusion-based progressive neighboring-view prediction paradigm: starting from the input image, it autoregressively synthesizes a sequence of neighboring views in increasing angular distance. We introduce two novel encoders—Stacked Local Encoding (Stacked-LE) for fine-grained local feature modeling and LSTM-based Global Encoding (LSTM-GE) for capturing long-range view dependencies—operating synergistically. Furthermore, we impose explicit multi-view consistency constraints during diffusion training to regularize geometry and appearance coherence. Evaluated on Objaverse and ShapeNet, our method achieves a 32% reduction in 3D FID and a 4.8 dB improvement in PSNR over prior art, significantly enhancing geometric fidelity, textural consistency, and renderability of generated 3D assets.
📝 Abstract
Novel view synthesis (NVS) is a cornerstone for image-to-3d creation. However, existing works still struggle to maintain consistency between the generated views and the input views, especially when there is a significant camera pose difference, leading to poor-quality 3D geometries and textures. We attribute this issue to their treatment of all target views with equal priority according to our empirical observation that the target views closer to the input views exhibit higher fidelity. With this inspiration, we propose AR-1-to-3, a novel next-view prediction paradigm based on diffusion models that first generates views close to the input views, which are then utilized as contextual information to progressively synthesize farther views. To encode the generated view subsequences as local and global conditions for the next-view prediction, we accordingly develop a stacked local feature encoding strategy (Stacked-LE) and an LSTM-based global feature encoding strategy (LSTM-GE). Extensive experiments demonstrate that our method significantly improves the consistency between the generated views and the input views, producing high-fidelity 3D assets.