SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the spatial hallucinations—such as object collisions—that commonly arise in existing single-pass 3D indoor scene synthesis methods due to insufficient reasoning. To mitigate this, we propose a vision-based self-reflective framework that iteratively detects and resolves spatial conflicts through multimodal feedback within a multi-round “diagnose-and-act” loop. Our approach pioneers a novel 3D generation paradigm by integrating self-reflection with multi-turn reinforcement learning, supported by SceneChain-12k, a large-scale dataset of causal construction trajectories. We employ a two-stage training strategy: supervised fine-tuning followed by agent-based reinforcement learning, effectively unifying visual perception, language instructions, and spatial reasoning. The method achieves state-of-the-art performance in both high-fidelity scene generation and goal-oriented tasks, while demonstrating strong generalization capabilities on long-tail scenarios.

Technology Category

Application Category

📝 Abstract

Current one-pass 3D scene synthesis methods often suffer from spatial hallucinations, such as collisions, due to a lack of deliberative reasoning. To bridge this gap, we introduce SceneReVis, a vision-grounded self-reflection framework that employs an iterative ``diagnose-and-act''loop to explicitly intercept and resolve spatial conflicts using multi-modal feedback. To support this step-wise paradigm, we construct SceneChain-12k, a large-scale dataset of causal construction trajectories derived through a novel reverse engineering pipeline. We further propose a two-stage training recipe that transitions from Supervised Fine-Tuning to Agentic Reinforcement Learning, evolving the model into an active spatial planner. Extensive experiments demonstrate that SceneReVis achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.

Problem

Research questions and friction points this paper is trying to address.

3D scene synthesis

spatial hallucinations

collision

deliberative reasoning

indoor scene generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-reflection

vision-grounded

multi-turn RL