š¤ AI Summary
This work addresses the challenges of geometric ambiguity and temporal inconsistency in monocular video when performing large-angle camera redirection, which often arise from insufficient observational cues. To overcome these limitations, the authors propose a training-free framework that explicitly decouples foreground and background reconstruction through a geometrically complete 4D proxy representation. The method integrates point cloud back-projection, an object-centric multi-view diffusion model, pixel-level 3Dā3D correspondence alignment, and conditional video diffusion generation, all guided by the 4D structure to enable high-quality, temporally coherent video redirection from arbitrary viewpoints. Experiments demonstrate that the approach significantly improves geometric consistency and visual fidelity under large camera trajectories and complex scenes, while also supporting downstream applications such as edit propagation.
š Abstract
Camera redirection aims to replay a dynamic scene from a single monocular video under a user-specified camera trajectory. However, large-angle redirection is inherently ill-posed: a monocular video captures only a narrow spatio-temporal view of a dynamic 3D scene, providing highly partial observations of the underlying 4D world. The key challenge is therefore to recover a complete and coherent representation from this limited input, with consistent geometry and motion. While recent diffusion-based methods achieve impressive results, they often break down under large-angle viewpoint changes far from the original trajectory, where missing visual grounding leads to severe geometric ambiguity and temporal inconsistency. To address this, we present FreeOrbit4D, an effective training-free framework that tackles this geometric ambiguity by recovering a geometry-complete 4D proxy as structural grounding for video generation. We obtain this proxy by decoupling foreground and background reconstructions: we unproject the monocular video into a static background and geometry-incomplete foreground point clouds in a unified global space, then leverage an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct geometry-complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D--3D correspondences and projecting the geometry-complete 4D proxy onto target camera viewpoints, we provide geometric scaffolds that guide a conditional video diffusion model. Extensive experiments show that FreeOrbit4D produces more faithful redirected videos under challenging large-angle trajectories, and our geometry-complete 4D proxy further opens a potential avenue for practical applications such as edit propagation and 4D data generation. Project page and code will be released soon.