🤖 AI Summary
To address the scarcity of high-quality synthetic data and the high cost of manual modeling in 3D indoor scene understanding, this paper proposes an end-to-end customizable 3D scene synthesis framework. Methodologically, it introduces a novel unified generation paradigm integrating text-to-image diffusion, multi-view diffusion, and NeRF meshing; designs a geometry-appearance joint loss function; and employs a progressive training strategy to generate high-fidelity 3D object assets from text descriptions and automatically assemble them into target floor plans. Contributions include: (1) significantly improved geometric accuracy, texture realism, and scene diversity of synthetic data; (2) substantial reduction in reliance on manual 3D modeling; and (3) empirically validated gains in model generalization and robustness on downstream tasks—including depth estimation and object tracking—demonstrating the framework’s effectiveness for training vision models.
📝 Abstract
Modern machine learning models for scene understanding, such as depth estimation and object tracking, rely on large, high-quality datasets that mimic real-world deployment scenarios. To address data scarcity, we propose an end-to-end system for synthetic data generation for scalable, high-quality, and customizable 3D indoor scenes. By integrating and adapting text-to-image and multi-view diffusion models with Neural Radiance Field-based meshing, this system generates highfidelity 3D object assets from text prompts and incorporates them into pre-defined floor plans using a rendering tool. By introducing novel loss functions and training strategies into existing methods, the system supports on-demand scene generation, aiming to alleviate the scarcity of current available data, generally manually crafted by artists. This system advances the role of synthetic data in addressing machine learning training limitations, enabling more robust and generalizable models for real-world applications.