Steerable Scene Generation with Post Training and Inference-Time Search

📅 2025-05-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity and high manual construction cost of photorealistic, task-specific 3D scenes for robot simulation training, this paper proposes a physics-aware scene generation framework based on diffusion models. Methodologically: (1) it unifies object selection and SE(3) pose prediction within a single generative model; (2) introduces the first inference-time, goal-directed search strategy integrating Monte Carlo Tree Search (MCTS); and (3) ensures geometric and physical feasibility via joint projection-based correction and physics-based simulation validation. Contributions include: (i) the first open-source SE(3) scene dataset exceeding 44 million samples; (ii) support for conditional generation, post-training with reinforcement learning, and scalable synthesis across five distinct environment categories; and (iii) significant improvements in clutter density, layout plausibility, and task alignment. All code, models, and datasets are publicly released.

Technology Category

Application Category

📝 Abstract
Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments. Website with videos, code, data, and model weights: https://steerable-scene-generation.github.io/
Problem

Research questions and friction points this paper is trying to address.

Generating diverse 3D scenes for robot training simulations
Adapting procedural models to task-specific robotic manipulation goals
Ensuring physical feasibility in goal-directed scene synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified diffusion model predicts object placement
Reinforcement learning adapts scene generation
MCTS search optimizes diffusion model inference
🔎 Similar Papers
No similar papers found.