Text To 3D Object Generation For Scalable Room Assembly

📅 2025-04-12

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

To address the scarcity of high-quality synthetic data and the high cost of manual modeling in 3D indoor scene understanding, this paper proposes an end-to-end customizable 3D scene synthesis framework. Methodologically, it introduces a novel unified generation paradigm integrating text-to-image diffusion, multi-view diffusion, and NeRF meshing; designs a geometry-appearance joint loss function; and employs a progressive training strategy to generate high-fidelity 3D object assets from text descriptions and automatically assemble them into target floor plans. Contributions include: (1) significantly improved geometric accuracy, texture realism, and scene diversity of synthetic data; (2) substantial reduction in reliance on manual 3D modeling; and (3) empirically validated gains in model generalization and robustness on downstream tasks—including depth estimation and object tracking—demonstrating the framework’s effectiveness for training vision models.

Technology Category

Application Category

📝 Abstract

Modern machine learning models for scene understanding, such as depth estimation and object tracking, rely on large, high-quality datasets that mimic real-world deployment scenarios. To address data scarcity, we propose an end-to-end system for synthetic data generation for scalable, high-quality, and customizable 3D indoor scenes. By integrating and adapting text-to-image and multi-view diffusion models with Neural Radiance Field-based meshing, this system generates highfidelity 3D object assets from text prompts and incorporates them into pre-defined floor plans using a rendering tool. By introducing novel loss functions and training strategies into existing methods, the system supports on-demand scene generation, aiming to alleviate the scarcity of current available data, generally manually crafted by artists. This system advances the role of synthetic data in addressing machine learning training limitations, enabling more robust and generalizable models for real-world applications.

Problem

Research questions and friction points this paper is trying to address.

Generates synthetic 3D indoor scenes for training data

Converts text prompts into high-fidelity 3D object assets

Addresses scarcity of manually crafted 3D training datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates text-to-image with multi-view diffusion models

Uses Neural Radiance Field for 3D meshing

Introduces novel loss functions for scene generation

🔎 Similar Papers

Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints