SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the spatial hallucinations—such as object collisions—that commonly arise in existing single-pass 3D indoor scene synthesis methods due to insufficient reasoning. To mitigate this, we propose a vision-based self-reflective framework that iteratively detects and resolves spatial conflicts through multimodal feedback within a multi-round “diagnose-and-act” loop. Our approach pioneers a novel 3D generation paradigm by integrating self-reflection with multi-turn reinforcement learning, supported by SceneChain-12k, a large-scale dataset of causal construction trajectories. We employ a two-stage training strategy: supervised fine-tuning followed by agent-based reinforcement learning, effectively unifying visual perception, language instructions, and spatial reasoning. The method achieves state-of-the-art performance in both high-fidelity scene generation and goal-oriented tasks, while demonstrating strong generalization capabilities on long-tail scenarios.

Technology Category

Application Category

📝 Abstract
Current one-pass 3D scene synthesis methods often suffer from spatial hallucinations, such as collisions, due to a lack of deliberative reasoning. To bridge this gap, we introduce SceneReVis, a vision-grounded self-reflection framework that employs an iterative ``diagnose-and-act''loop to explicitly intercept and resolve spatial conflicts using multi-modal feedback. To support this step-wise paradigm, we construct SceneChain-12k, a large-scale dataset of causal construction trajectories derived through a novel reverse engineering pipeline. We further propose a two-stage training recipe that transitions from Supervised Fine-Tuning to Agentic Reinforcement Learning, evolving the model into an active spatial planner. Extensive experiments demonstrate that SceneReVis achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.
Problem

Research questions and friction points this paper is trying to address.

3D scene synthesis
spatial hallucinations
collision
deliberative reasoning
indoor scene generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-reflection
vision-grounded
multi-turn RL
spatial reasoning
3D scene synthesis
🔎 Similar Papers
No similar papers found.
Yang Zhao
Yang Zhao
Research Professor, Zhejiang University, China
Intelligent BuildingSmart GridFault detection and diagnosisEnergy efficiency
Shizhao Sun
Shizhao Sun
Microsoft
M
Meisheng Zhang
Microsoft Research Asia, Beijing, China; Peking University, Beijing, China
Y
Yingdong Shi
Microsoft Research Asia, Beijing, China; ShanghaiTech University, Shanghai, China
X
Xubo Yang
Shanghai Jiao Tong University, Shanghai, China
Jiang Bian
Jiang Bian
Microsoft Research
Industry AIRLReasoningSpatial Intelligence