🤖 AI Summary
This work addresses the spatial hallucinations—such as object collisions—that commonly arise in existing single-pass 3D indoor scene synthesis methods due to insufficient reasoning. To mitigate this, we propose a vision-based self-reflective framework that iteratively detects and resolves spatial conflicts through multimodal feedback within a multi-round “diagnose-and-act” loop. Our approach pioneers a novel 3D generation paradigm by integrating self-reflection with multi-turn reinforcement learning, supported by SceneChain-12k, a large-scale dataset of causal construction trajectories. We employ a two-stage training strategy: supervised fine-tuning followed by agent-based reinforcement learning, effectively unifying visual perception, language instructions, and spatial reasoning. The method achieves state-of-the-art performance in both high-fidelity scene generation and goal-oriented tasks, while demonstrating strong generalization capabilities on long-tail scenarios.
📝 Abstract
Current one-pass 3D scene synthesis methods often suffer from spatial hallucinations, such as collisions, due to a lack of deliberative reasoning. To bridge this gap, we introduce SceneReVis, a vision-grounded self-reflection framework that employs an iterative ``diagnose-and-act''loop to explicitly intercept and resolve spatial conflicts using multi-modal feedback. To support this step-wise paradigm, we construct SceneChain-12k, a large-scale dataset of causal construction trajectories derived through a novel reverse engineering pipeline. We further propose a two-stage training recipe that transitions from Supervised Fine-Tuning to Agentic Reinforcement Learning, evolving the model into an active spatial planner. Extensive experiments demonstrate that SceneReVis achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.