Novel View Synthesis using DDIM Inversion

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Single-image novel view synthesis requires simultaneous occlusion inpainting, 3D structure extrapolation, and geometric consistency preservation. Existing approaches rely on multi-view fine-tuning or full diffusion model training, incurring high computational cost and often yielding blurry outputs with poor generalization. This paper proposes a lightweight explicit view transformation framework. First, the source image’s latent representation is obtained via DDIM inversion. Then, conditioned on camera pose, a compact TUNet predicts the target-view latent. Crucially, we introduce a latent-space fusion strategy that leverages noise correlation structures to preserve fine-grained texture details, while harnessing the prior of a pre-trained diffusion model for high-fidelity generation. Importantly, our method requires no fine-tuning of large diffusion models. Evaluated on MVImgNet, it significantly outperforms state-of-the-art methods, achieving substantial improvements in both visual sharpness and geometric consistency.

Technology Category

Application Category

📝 Abstract

Synthesizing novel views from a single input image is a challenging task. It requires extrapolating the 3D structure of a scene while inferring details in occluded regions, and maintaining geometric consistency across viewpoints. Many existing methods must fine-tune large diffusion backbones using multiple views or train a diffusion model from scratch, which is extremely expensive. Additionally, they suffer from blurry reconstruction and poor generalization. This gap presents the opportunity to explore an explicit lightweight view translation framework that can directly utilize the high-fidelity generative capabilities of a pretrained diffusion model while reconstructing a scene from a novel view. Given the DDIM-inverted latent of a single input image, we employ a camera pose-conditioned translation U-Net, TUNet, to predict the inverted latent corresponding to the desired target view. However, the image sampled using the predicted latent may result in a blurry reconstruction. To this end, we propose a novel fusion strategy that exploits the inherent noise correlation structure observed in DDIM inversion. The proposed fusion strategy helps preserve the texture and fine-grained details. To synthesize the novel view, we use the fused latent as the initial condition for DDIM sampling, leveraging the generative prior of the pretrained diffusion model. Extensive experiments on MVImgNet demonstrate that our method outperforms existing methods.

Problem

Research questions and friction points this paper is trying to address.

Synthesizing novel views from a single image

Overcoming blurry reconstruction and poor generalization

Utilizing pretrained diffusion models efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

DDIM inversion for novel view synthesis

Camera pose-conditioned U-Net translation

Noise correlation fusion strategy

🔎 Similar Papers

ViewFusion: Learning Composable Diffusion Models for Novel View Synthesis