🤖 AI Summary
Single-image 3D reconstruction suffers from cross-view inconsistency (CVC)—a fundamental challenge wherein multi-view images synthesized from a single input exhibit geometric and appearance discrepancies across viewpoints, severely degrading reconstruction fidelity. To address this, we propose AlignCVC, a novel framework that enforces distribution-level alignment between generated and reconstructed views via a soft-hard dual-path mechanism, transcending the limitations of conventional regression-based losses. Leveraging pretrained generative models and differentiable rendering, AlignCVC enables end-to-end optimization and supports plug-and-play integration with diverse generation and reconstruction backbones. Experiments demonstrate that AlignCVC achieves state-of-the-art performance in reconstruction accuracy, visual consistency, and inference efficiency—surpassing existing feedback-optimization methods with only four optimization steps—while exhibiting strong generalization across architectures and scenes.
📝 Abstract
Single-image-to-3D models typically follow a sequential generation and reconstruction workflow. However, intermediate multi-view images synthesized by pre-trained generation models often lack cross-view consistency (CVC), significantly degrading 3D reconstruction performance. While recent methods attempt to refine CVC by feeding reconstruction results back into the multi-view generator, these approaches struggle with noisy and unstable reconstruction outputs that limit effective CVC improvement. We introduce AlignCVC, a novel framework that fundamentally re-frames single-image-to-3D generation through distribution alignment rather than relying on strict regression losses. Our key insight is to align both generated and reconstructed multi-view distributions toward the ground-truth multi-view distribution, establishing a principled foundation for improved CVC. Observing that generated images exhibit weak CVC while reconstructed images display strong CVC due to explicit rendering, we propose a soft-hard alignment strategy with distinct objectives for generation and reconstruction models. This approach not only enhances generation quality but also dramatically accelerates inference to as few as 4 steps. As a plug-and-play paradigm, our method, namely AlignCVC, seamlessly integrates various multi-view generation models with 3D reconstruction models. Extensive experiments demonstrate the effectiveness and efficiency of AlignCVC for single-image-to-3D generation.