PanoWorld-X: Generating Explorable Panoramic Worlds via Sphere-Aware Video Diffusion

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods struggle to generate high-fidelity, freely explorable 360° panoramic videos: narrow field-of-view constraints cause scene discontinuities, while coarse-grained camera control limits exploration freedom. To address this, we propose Spherical Perception Diffusion Transformer (SP-DiT), which reprojects equirectangular features into a spherical latent space to explicitly model geometric adjacency and spatiotemporal consistency. We further construct a large-scale panoramic video–trajectory paired dataset using Unreal Engine and integrate spherical reprojection with video diffusion modeling for controllable generation. Experiments demonstrate that SP-DiT significantly outperforms state-of-the-art methods in motion range, trajectory control accuracy, and visual quality. The approach exhibits strong potential for immersive VR applications and embodied agent navigation, enabling precise, geometry-aware, and temporally coherent panoramic video synthesis.

Technology Category

Application Category

📝 Abstract
Generating a complete and explorable 360-degree visual world enables a wide range of downstream applications. While prior works have advanced the field, they remain constrained by either narrow field-of-view limitations, which hinder the synthesis of continuous and holistic scenes, or insufficient camera controllability that restricts free exploration by users or autonomous agents. To address this, we propose PanoWorld-X, a novel framework for high-fidelity and controllable panoramic video generation with diverse camera trajectories. Specifically, we first construct a large-scale dataset of panoramic video-exploration route pairs by simulating camera trajectories in virtual 3D environments via Unreal Engine. As the spherical geometry of panoramic data misaligns with the inductive priors from conventional video diffusion, we then introduce a Sphere-Aware Diffusion Transformer architecture that reprojects equirectangular features onto the spherical surface to model geometric adjacency in latent space, significantly enhancing visual fidelity and spatiotemporal continuity. Extensive experiments demonstrate that our PanoWorld-X achieves superior performance in various aspects, including motion range, control precision, and visual quality, underscoring its potential for real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Generating explorable 360-degree panoramic visual worlds
Overcoming narrow field-of-view and camera controllability limitations
Enhancing visual fidelity and spatiotemporal continuity in panoramas
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates panoramic videos via sphere-aware diffusion transformer
Uses Unreal Engine to simulate camera trajectory datasets
Reprojects equirectangular features onto spherical surfaces
🔎 Similar Papers
No similar papers found.
Yuyang Yin
Yuyang Yin
Beijing Jiaotong University
Computer VisionAIGC
H
HaoXiang Guo
Skywork AI
Fangfu Liu
Fangfu Liu
Tsinghua University
Computer Vision3D VisionMachine Learning
M
Mengyu Wang
Beijing Jiaotong University
Hanwen Liang
Hanwen Liang
University of Toronto
E
Eric Li
Skywork AI
Y
Yikai Wang
Beijing Normal University
X
Xiaojie Jin
Beijing Jiaotong University
Y
Yao Zhao
Beijing Jiaotong University
Yunchao Wei
Yunchao Wei
Professor, Beijing Jiaotong University, UTS, UIUC, NUS
Computer VisionMachine Learning