🤖 AI Summary
This work addresses the challenge of insufficient geometric consistency and reconstruction fidelity in novel view synthesis under sparse observations and without camera pose information. The authors propose ReNoV, a framework that, for the first time, systematically leverages the geometric and semantic correspondences embedded in the spatial attention of external visual representations. By designing a dedicated representation projection module, ReNoV injects these correspondences as conditioning signals into a diffusion-based generative process, enabling high-quality view synthesis without explicit pose supervision. Evaluated on standard benchmarks, ReNoV significantly outperforms existing diffusion-based methods, achieving notable improvements in reconstruction fidelity, image inpainting quality, and geometric consistency.
📝 Abstract
We propose a novel framework for diffusion-based novel view synthesis in which we leverage external representations as conditions, harnessing their geometric and semantic correspondence properties for enhanced geometric consistency in generated novel viewpoints. First, we provide a detailed analysis exploring the correspondence capabilities emergent in the spatial attention of external visual representations. Building from these insights, we propose a representation-guided novel view synthesis through dedicated representation projection modules that inject external representations into the diffusion process, a methodology named ReNoV, short for representation-guided novel view synthesis. Our experiments show that this design yields marked improvements in both reconstruction fidelity and inpainting quality, outperforming prior diffusion-based novel-view methods on standard benchmarks and enabling robust synthesis from sparse, unposed image collections.