🤖 AI Summary
To address the scarcity and high manual construction cost of photorealistic, task-specific 3D scenes for robot simulation training, this paper proposes a physics-aware scene generation framework based on diffusion models. Methodologically: (1) it unifies object selection and SE(3) pose prediction within a single generative model; (2) introduces the first inference-time, goal-directed search strategy integrating Monte Carlo Tree Search (MCTS); and (3) ensures geometric and physical feasibility via joint projection-based correction and physics-based simulation validation. Contributions include: (i) the first open-source SE(3) scene dataset exceeding 44 million samples; (ii) support for conditional generation, post-training with reinforcement learning, and scalable synthesis across five distinct environment categories; and (iii) significant improvements in clutter density, layout plausibility, and task alignment. All code, models, and datasets are publicly released.
📝 Abstract
Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments. Website with videos, code, data, and model weights: https://steerable-scene-generation.github.io/