🤖 AI Summary
To address the challenge of precisely correcting local errors in complex 3D scene layouts, this paper proposes a human-in-the-loop local correction method: users identify erroneous regions from a first-person perspective, and the system performs semantically consistent, fill-in-style automatic repair using structured language representations and an NLP-inspired infilling mechanism. This work pioneers the adaptation of the infilling paradigm to local 3D layout refinement—enabling iterative, out-of-distribution layout polishing—and establishes a low-friction “one-click repair” interaction protocol. Key innovations include a multi-task SceneScript model and a language-driven local-global co-optimization framework. Experiments demonstrate that the method preserves global layout prediction accuracy while significantly improving local correction fidelity, enabling high-fidelity, user-controllable reconstruction of real-world scenes.
📝 Abstract
We present a novel human-in-the-loop approach to estimate 3D scene layout that uses human feedback from an egocentric standpoint. We study this approach through introduction of a novel local correction task, where users identify local errors and prompt a model to automatically correct them. Building on SceneScript, a state-of-the-art framework for 3D scene layout estimation that leverages structured language, we propose a solution that structures this problem as"infilling", a task studied in natural language processing. We train a multi-task version of SceneScript that maintains performance on global predictions while significantly improving its local correction ability. We integrate this into a human-in-the-loop system, enabling a user to iteratively refine scene layout estimates via a low-friction"one-click fix'' workflow. Our system enables the final refined layout to diverge from the training distribution, allowing for more accurate modelling of complex layouts.