🤖 AI Summary
This work addresses the significant challenges of single-image structured 3D reconstruction in scenes with complex occlusions and clutter, where conventional approaches relying on semantic segmentation and depth estimation exhibit limited performance. The authors propose an iterative object removal and reconstruction framework that progressively detects, segments, and removes foreground objects, fits their 3D models, and thereby decomposes a complex scene into a sequence of simpler subtasks. Innovatively, a vision-language model (VLM) is employed as a coordinator to orchestrate instance segmentation, image inpainting, and 3D object detection modules without requiring task-specific training, enabling an end-to-end, fine-tuning-free reconstruction pipeline. Experiments on 3D-FRONT and ADE20K demonstrate that the method substantially outperforms existing techniques, particularly exhibiting superior robustness and reconstruction accuracy in highly occluded scenarios.
📝 Abstract
We present SeeingThroughClutter, a method for reconstructing structured 3D representations from single images by segmenting and modeling objects individually. Prior approaches rely on intermediate tasks such as semantic segmentation and depth estimation, which often underperform in complex scenes, particularly in the presence of occlusion and clutter. We address this by introducing an iterative object removal and reconstruction pipeline that decomposes complex scenes into a sequence of simpler subtasks. Using VLMs as orchestrators, foreground objects are removed one at a time via detection, segmentation, object removal, and 3D fitting. We show that removing objects allows for cleaner segmentations of subsequent objects, even in highly occluded scenes. Our method requires no task-specific training and benefits directly from ongoing advances in foundation models. We demonstrate stateof-the-art robustness on 3D-Front and ADE20K datasets. Project Page: https://rioak.github.io/seeingthroughclutter/