🤖 AI Summary
This work proposes OneWorld, a novel framework that enables end-to-end 3D-aware generative modeling in a 3D latent space, addressing the limitations of existing diffusion-based 3D scene generation methods that operate primarily in 2D latent spaces and struggle to maintain geometric and appearance consistency across multiple views. Central to OneWorld is a 3D Unified Representation Autoencoder (3D-URAE) that jointly encodes geometry, appearance, and semantics into a coherent 3D latent representation. To further enhance multi-view consistency and mitigate the train-inference discrepancy inherent in diffusion models, the framework introduces a Cross-View Correspondence (CVC) loss and a Manifold Drift Forcing (MDF) training strategy. Experimental results demonstrate that OneWorld significantly outperforms current 2D-based approaches in both cross-view consistency and overall generation quality.
📝 Abstract
Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train-inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods. Our code will be available at https://github.com/SensenGao/OneWorld.