🤖 AI Summary
Traditional 3D scene synthesis relies heavily on manual design, and automation faces two key bottlenecks: (1) weak spatial reasoning capabilities of large language models (LLMs), leading to geometric inconsistencies; and (2) limited viewpoint diversity and cross-view inconsistency in image-based generation methods. This paper proposes VIPScene, the first framework to explicitly leverage implicit 3D physical commonsense embedded in video diffusion models. It introduces First-Person View Scoring (FPVScore), a novel mechanism that enforces semantic and geometric consistency across viewpoints. VIPScene integrates video generation, feedforward 3D reconstruction, open-vocabulary perception, and multimodal LLMs to support text- or image-driven continuous-view reasoning and holistic scene evaluation. Extensive experiments demonstrate significant improvements over state-of-the-art methods across diverse scene categories, achieving superior generalization, photorealism, and physically plausible layout—advancing prompt-driven 3D scene synthesis toward practical deployment.
📝 Abstract
Traditionally, 3D scene synthesis requires expert knowledge and significant manual effort. Automating this process could greatly benefit fields such as architectural design, robotics simulation, virtual reality, and gaming. Recent approaches to 3D scene synthesis often rely on the commonsense reasoning of large language models (LLMs) or strong visual priors of modern image generation models. However, current LLMs demonstrate limited 3D spatial reasoning ability, which restricts their ability to generate realistic and coherent 3D scenes. Meanwhile, image generation-based methods often suffer from constraints in viewpoint selection and multi-view inconsistencies. In this work, we present Video Perception models for 3D Scene synthesis (VIPScene), a novel framework that exploits the encoded commonsense knowledge of the 3D physical world in video generation models to ensure coherent scene layouts and consistent object placements across views. VIPScene accepts both text and image prompts and seamlessly integrates video generation, feedforward 3D reconstruction, and open-vocabulary perception models to semantically and geometrically analyze each object in a scene. This enables flexible scene synthesis with high realism and structural consistency. For more precise analysis, we further introduce First-Person View Score (FPVScore) for coherence and plausibility evaluation, utilizing continuous first-person perspective to capitalize on the reasoning ability of multimodal large language models. Extensive experiments show that VIPScene significantly outperforms existing methods and generalizes well across diverse scenarios. The code will be released.