🤖 AI Summary
Existing monocular video re-rendering methods either suffer from spatially inconsistent deformations due to inadequate scene understanding or rely on explicit geometric reconstruction, making them sensitive to errors in depth estimation and camera calibration. This work proposes a novel approach that, for the first time, integrates implicit scene representations from large-scale 4D reconstruction models into a diffusion-based generative framework. By jointly optimizing latent scene variables conditioned on source camera poses, our method enables high-fidelity novel-view synthesis without requiring explicit geometry estimation. The approach significantly mitigates view-dependent drift and structural distortions during viewpoint transitions while preserving visual fidelity, achieving state-of-the-art performance on video re-rendering benchmarks.
📝 Abstract
Given a monocular video, the goal of video re-rendering is to generate views of the scene from a novel camera trajectory. Existing methods face two distinct challenges. Geometrically unconditioned models lack spatial awareness, leading to drift and deformation under viewpoint changes. On the other hand, geometrically-conditioned models depend on estimated depth and explicit reconstruction, making them susceptible to depth inaccuracies and calibration errors. We propose to address these challenges by using the implicit geometric knowledge embedded in the latent space of a large 4D reconstruction model to condition the video generation process. These latents capture scene structure in a continuous space without explicit reconstruction. Therefore, they provide a flexible representation that allows the pretrained diffusion prior to regularize errors more effectively. By jointly conditioning on these latents and source camera poses, we demonstrate that our model achieves state-of-the-art results on the video re-rendering task. Project webpage is https://lavr-4d-scene-rerender.github.io/