Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing

πŸ“… 2026-03-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing open-vocabulary 3D indoor scene editing methods often suffer from structural degradation, physical inconsistencies, and unintended global alterations due to full-scene regeneration or image-space manipulations. This work proposes the first formulation of 3D scene editing as a goal-directed symbolic planning task, introducing EditLangβ€”a novel action language that explicitly encodes preconditions, effects, and geometric relationships such as support and collision. By integrating a language-driven planner with a multi-constraint verification mechanism, the approach decouples high-level reasoning from low-level generation. Evaluated on E2A-Bench, a benchmark comprising 63 editing tasks, the method significantly outperforms existing approaches, achieving high-fidelity, physically plausible, and interpretable edits while preserving the unmodified portions of the original scene.

Technology Category

Application Category

πŸ“ Abstract
Editing a 3D indoor scene from natural language is conceptually straightforward but technically challenging. Existing open-vocabulary systems often regenerate large portions of a scene or rely on image-space edits that disrupt spatial structure, resulting in unintended global changes or physically inconsistent layouts. These limitations stem from treating editing primarily as a generative task. We take a different view. A user instruction defines a desired world state, and editing should be the minimal sequence of actions that makes this state true while preserving everything else. This perspective motivates Edit-As-Act, a framework that performs open-vocabulary scene editing as goal-regressive planning in 3D space. Given a source scene and free-form instruction, Edit-As-Act predicts symbolic goal predicates and plans in EditLang, a PDDL-inspired action language that we design with explicit preconditions and effects encoding support, contact, collision, and other geometric relations. A language-driven planner proposes actions, and a validator enforces goal-directedness, monotonicity, and physical feasibility, producing interpretable and physically coherent transformations. By separating reasoning from low-level generation, Edit-As-Act achieves instruction fidelity, semantic consistency, and physical plausibility - three criteria that existing paradigms cannot satisfy together. On E2A-Bench, our benchmark of 63 editing tasks across 9 indoor environments, Edit-As-Act significantly outperforms prior approaches across all edit types and scene categories.
Problem

Research questions and friction points this paper is trying to address.

open-vocabulary editing
3D indoor scene editing
physical plausibility
semantic consistency
instruction fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

goal-regressive planning
open-vocabulary editing
symbolic action language
physical plausibility
3D scene manipulation
πŸ”Ž Similar Papers
No similar papers found.