Perceive-then-Plan: Layout-as-Policy for Monocular 3D Scene Layout Estimation

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the challenge of generating structured 3D scene layouts from a single image while ensuring both visual consistency and physical plausibility—a task poorly handled by direct prediction approaches. The authors propose a two-stage “perceive-and-plan” framework: an initial layout is first generated using a geometry-enhanced Perceiver module, followed by iterative refinement through policy learning in a discrete action space comprising translation, rotation, and scaling. Innovatively, layout estimation is cast as a preference-driven planning process that operates without explicit rewards, leveraging supervised trajectory initialization and a vision-language model to capture global constraints and complex object interactions. The method significantly outperforms existing approaches in physical coherence and image alignment, and naturally supports downstream tasks such as scene editing.

📝 Abstract

Building structured 3D scene layouts from a single image requires reconciling visual observations with physical and spatial constraints, a challenge that is difficult to address with direct prediction alone. In this work, we formulate monocular 3D layout estimation as a perceive-then-plan problem with vision-language models, where a Perceiver first grounds the 3D objects and then a Planner iteratively refines the scene hypothesis through actions that improve physical plausibility while preserving consistency with the input image. We propose Layout-as-Policy (LaP), which casts the planning stage as a policy learning problem: 3D layouts are represented as structured states, and refined via discrete actions such as translation, rotation, and rescaling. Starting from an observation-aligned initialization with the geometry-enhanced Perceiver, the LaP Planner is trained to produce action sequences that progressively resolve geometric inconsistencies and enforce realistic spatial relations. To enable effective learning, we combine supervised trajectory initialization with preference-based optimization, allowing the model to learn corrective behaviors without requiring explicit reward engineering. This formulation transforms layout estimation from a one-shot prediction task into an iterative refinement process, enabling better handling of global constraints and complex object interactions. Experiments demonstrate that our approach produces layouts that are more physically coherent and better aligned with visual observations, while naturally supporting downstream tasks such as scene editing and manipulation.

Problem

Research questions and friction points this paper is trying to address.

monocular 3D scene layout estimation

physical plausibility

spatial constraints

structured 3D layouts

visual observations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layout-as-Policy

perceive-then-plan

vision-language models