PanoWorld: Geometry-Consistent Panoramic Video World Modeling

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing methods for panoramic video generation often suffer from depth inconsistencies and motion distortions due to the lack of explicit modeling of 3D scene geometry and dynamic consistency. This work reframes panoramic video generation as a geometric consistency modeling task, leveraging a pretrained perspective-video world model and introducing regularization terms that enforce depth and trajectory consistency. To better capture spherical geometry, the approach incorporates a geometry-aware positional encoding and a conditional adaptation mechanism. Furthermore, the authors construct PanoGeo, the first unified geometry-aware panoramic video dataset, for both training and evaluation. Experiments demonstrate that the proposed method significantly improves geometric and dynamic consistency while preserving visual realism, outperforming existing approaches and effectively supporting embodied intelligence in understanding global spatial structures.

📝 Abstract

We present PanoWorld, a panoramic video world model that generates geometry-consistent 360$\degree$ video from a single image and a caption. Existing panoramic video methods optimize primarily for visual realism and do not explicitly constrain the underlying 3D scene state, producing outputs that appear plausible yet exhibit inconsistent depth, broken correspondences, and implausible motion across the spherical surface. We address this gap by framing panoramic video generation as a geometry- and dynamics-consistent latent state modeling problem rather than pure visual synthesis. Building on a pre-trained perspective video world model, we introduce two lightweight regularizers: a depth consistency loss against pseudo ground-truth panoramic depth, and a trajectory consistency loss that supervises the 3D world-frame positions of tracked points across time. We further apply spherical-geometry-aware adaptation to the conditioning and positional encoding. We additionally introduce PanoGeo, a unified geometry-aware panoramic video dataset with consistent depth, trajectory, and prompt annotations across diverse real and synthetic sources, used for both training and stratified evaluation. Experiments show that PanoWorld improves geometric consistency over prior panoramic generation methods while maintaining competitive visual realism, establishing that panoramic video generation must be treated as a geometric modeling problem to support the holistic spatial understanding requirements of embodied AI applications. Code is available at https://github.com/ostadabbas/PanoWorld.

Problem

Research questions and friction points this paper is trying to address.

panoramic video generation

geometry consistency

3D scene modeling

spherical surface

embodied AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

geometry-consistent generation

panoramic video modeling

depth consistency