π€ AI Summary
Existing generative models struggle to jointly perform temporal extrapolation and novel-view synthesis (NVS) for dynamic 4D driving scenes without per-scene optimization, primarily due to the difficulty of jointly modeling geometric and temporal consistency. This paper proposes a unified generative framework addressing this challenge. We introduce Stereo Forcingβa conditional strategy that leverages geometric uncertainty to guide diffusion-based denoising, explicitly enforcing geometric consistency across views and frames. To enable efficient 4D reconstruction, we integrate a pre-trained video VAE with a range-view adapter; furthermore, we design a geometry-guided video diffusion model to synthesize future multi-view sequences. Our method achieves state-of-the-art performance on appearance/geometry reconstruction, temporal generation, and NVS. Crucially, it demonstrates strong generalization in downstream perception and motion prediction tasks, validating its robustness and practical utility.
π Abstract
Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. Bridging generation and novel view synthesis remains a major challenge. We present PhiGenesis, a unified framework for 4D scene generation that extends video generation techniques with geometric and temporal consistency. Given multi-view image sequences and camera parameters, PhiGenesis produces temporally continuous 4D Gaussian splatting representations along target 3D trajectories. In its first stage, PhiGenesis leverages a pre-trained video VAE with a novel range-view adapter to enable feed-forward 4D reconstruction from multi-view images. This architecture supports single-frame or video inputs and outputs complete 4D scenes including geometry, semantics, and motion. In the second stage, PhiGenesis introduces a geometric-guided video diffusion model, using rendered historical 4D scenes as priors to generate future views conditioned on trajectories. To address geometric exposure bias in novel views, we propose Stereo Forcing, a novel conditioning strategy that integrates geometric uncertainty during denoising. This method enhances temporal coherence by dynamically adjusting generative influence based on uncertainty-aware perturbations. Our experimental results demonstrate that our method achieves state-of-the-art performance in both appearance and geometric reconstruction, temporal generation and novel view synthesis (NVS) tasks, while simultaneously delivering competitive performance in downstream evaluations. Homepage is at href{https://jiangxb98.github.io/PhiGensis}{PhiGensis}.