🤖 AI Summary
This work addresses the problem of generating unbounded dynamic 3D scenes from a single-view video, enabling spatially extensive camera fly-throughs with spatiotemporally consistent 3D motion and semantically controllable scene extension. To this end, we propose DynamicVoyager—a novel framework that first maps single-view video input into dynamic point clouds and introduces ray-wise contextual modeling. By jointly optimizing depth estimation, dynamic geometry reconstruction, and iterative ray-guided rendering, it explicitly embeds 3D motion priors into 2D outpainting. Unlike prior methods constrained by static geometry or limited temporal coherence, DynamicVoyager overcomes the inherent boundary limitations of single-view dynamic scene generation, supporting long-range camera trajectories and cross-frame motion consistency. Experiments demonstrate that our approach generates geometrically plausible, temporally coherent, and semantically editable infinite dynamic 3D scenes, achieving significant improvements over baselines in motion consistency, visual realism, and controllability.
📝 Abstract
This paper studies the problem of generating an unbounded dynamic scene from a single view, which has wide applications in augmented/virtual reality and robotics. Since the scene is changing over time, different generated views need to be consistent with the underlying 3D motions. While previous works learn such consistency by training from multiple views, the generated scene regions are bounded to be close to the training views with limited camera movements. To address this issue, we propose DynamicVoyager that reformulates the dynamic scene generation as a scene outpainting process for new dynamic content. As 2D outpainting models can hardly generate 3D consistent motions from only 2D pixels at a single view, we consider pixels as rays to enrich the pixel input with the ray context, so that the 3D motion consistency can be learned from the ray information. More specifically, we first map the single-view video input to a dynamic point cloud with the estimated video depths. Then we render the partial video at a novel view and outpaint the video with ray contexts from the point cloud to generate 3D consistent motions. We employ the outpainted video to update the point cloud, which is used for scene outpainting from future novel views. Experiments show that our model is able to generate unbounded scenes with consistent motions along fly-through cameras, and the generated contents can be controlled with scene prompts.