π€ AI Summary
Existing single-image or text-driven 3D world generation methods suffer from limited scene coverage and lack of free navigability. To address this, we propose the first unified framework for omnidirectional, explorable 3D world generation, integrating panoramic video diffusion with geometry-aware 3D reconstruction. Our method introduces a trajectory-guided panoramic video diffusion model, a feed-forward wide-field-of-view reconstruction network, and an optimization-driven end-to-end 3D reconstruction pipeline. To enable training and evaluation, we introduce Matrix-Panoβthe first large-scale synthetic dataset featuring dense depth maps and multi-view camera trajectories. Experiments demonstrate state-of-the-art performance on both panoramic video generation and 3D world reconstruction, significantly improving spatial scale, geometric consistency, and interactivity of generated scenes, thereby enabling immersive, large-scale spatial exploration.
π Abstract
Explorable 3D world generation from a single image or text prompt forms a cornerstone of spatial intelligence. Recent works utilize video model to achieve wide-scope and generalizable 3D world generation. However, existing approaches often suffer from a limited scope in the generated scenes. In this work, we propose Matrix-3D, a framework that utilize panoramic representation for wide-coverage omnidirectional explorable 3D world generation that combines conditional video generation and panoramic 3D reconstruction. We first train a trajectory-guided panoramic video diffusion model that employs scene mesh renders as condition, to enable high-quality and geometrically consistent scene video generation. To lift the panorama scene video to 3D world, we propose two separate methods: (1) a feed-forward large panorama reconstruction model for rapid 3D scene reconstruction and (2) an optimization-based pipeline for accurate and detailed 3D scene reconstruction. To facilitate effective training, we also introduce the Matrix-Pano dataset, the first large-scale synthetic collection comprising 116K high-quality static panoramic video sequences with depth and trajectory annotations. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance in panoramic video generation and 3D world generation. See more in https://matrix-3d.github.io.