π€ AI Summary
To address geometric distortion and inaccurate pose estimation caused by severe occlusion in open-set scenarios, this paper proposes a decoupled 3D scene generation framework that separates occlusion removal from 3D object generation to mitigate their mutual interference. We design a global-local attention mechanism integrating self-attention and cross-attention to enhance pose robustness, and introduce OpenScene3Dβthe first large-scale synthetic dataset tailored for open-set 3D scene composition. Furthermore, we propose a multi-scale unified pose estimation model. Our method is jointly trained on RGB images and dedicated deocclusion supervision signals, achieving significant improvements in geometric completeness and pose accuracy across diverse indoor and outdoor open-set scenes. It consistently outperforms state-of-the-art methods. The code and OpenScene3D dataset are publicly released.
π Abstract
We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets is released at https://idea-research.github.io/SceneMaker/.