🤖 AI Summary
Existing methods for 3D scene graph generation are hindered by scarce annotated data and susceptibility to object priors, making it challenging to design effective self-supervised pretraining tasks. This work proposes a topological layout learning framework that, for the first time, formulates predicate-aware topological layout reconstruction as a self-supervised objective. By modeling spatial priors conditioned on anchor points and leveraging graph neural networks for topological-geometric reasoning, the approach recovers the global structure of subgraphs. To preserve semantic fidelity, the method incorporates structure-aware multi-view augmentation and enhances relational representations through self-distillation. Evaluated on the 3DSSG dataset, the proposed framework significantly outperforms current state-of-the-art baselines, demonstrating its effectiveness and robustness.
📝 Abstract
3D Scene Graph (3DSG) generation plays a pivotal role in spatial understanding and semantic-affordance perception. However, its generalizability is often constrained by data scarcity. Current solutions primarily focus on cross-modal assisted representation learning and object-centric generation pre-training. The former relies heavily on predicate annotations, while the latter's predicate learning may be bypassed due to strong object priors. Consequently, they could not often provide a label-free and robust self-supervised proxy task for 3DSG fine-tuning. To bridge this gap, we propose a Topological Layout Learning (ToLL) for 3DSG pretraining framework. In detail, we design an Anchor-Conditioned Topological Geometry Reasoning, with a GNN to recover the global layout of zero-centered subgraphs by the spatial priors from sparse anchors. This process is strictly modulated by predicate features, thereby enforcing the predicate relation learning. Furthermore, we construct a Structural Multi-view Augmentation to avoid semantic corruption, and enhancing representations via self-distillation. The extensive experiments on 3DSSG dataset demonstrate that our ToLL could improve representation quality, outperforming state-of-the-art baselines.