🤖 AI Summary
Existing text-driven 3D scene generation methods struggle to produce semantically rich and photorealistic scenes due to the scarcity of aligned 3D-text data and multi-view inconsistencies. To address these limitations, this work proposes PSGS, a two-stage framework that first generates semantically coherent panoramic images through layout reasoning and a self-refinement module, then constructs a globally consistent 3D Gaussian splatting point cloud via a panoramic sliding mechanism. The approach innovatively integrates structured spatial relationship parsing, iterative feedback from multimodal large language models (MLLMs), and a panoramic sliding sampling strategy, complemented by depth and semantic consistency losses to enhance both semantic coherence and fine-grained fidelity. Experimental results demonstrate that PSGS outperforms current methods in generation quality and realism, offering a scalable solution for high-fidelity content creation in immersive applications such as VR and AR.
📝 Abstract
Generating realistic 3D scenes from text is crucial for immersive applications like VR, AR, and gaming. While text-driven approaches promise efficiency, existing methods suffer from limited 3D-text data and inconsistent multi-view stitching, resulting in overly simplistic scenes. To address this, we propose PSGS, a two-stage framework for high-fidelity panoramic scene generation. First, a novel two-layer optimization architecture generates semantically coherent panoramas: a layout reasoning layer parses text into structured spatial relationships, while a self-optimization layer refines visual details via iterative MLLM feedback. Second, our panorama sliding mechanism initializes globally consistent 3D Gaussian Splatting point clouds by strategically sampling overlapping perspectives. By incorporating depth and semantic coherence losses during training, we greatly improve the quality and detail fidelity of rendered scenes. Our experiments demonstrate that PSGS outperforms existing methods in panorama generation and produces more appealing 3D scenes, offering a robust solution for scalable immersive content creation.