FAST3DIS: Feed-forward Anchored Scene Transformer for 3D Instance Segmentation

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work proposes an end-to-end feedforward anchored scene Transformer architecture to address the limitations of existing 3D instance segmentation methods, which predominantly rely on non-end-to-end “lift-and-cluster” paradigms that decouple representation learning from segmentation objectives and hinder scalability. The proposed approach introduces learnable 3D anchor generation and anchor-sampling cross-attention mechanisms to achieve multi-view consistent instance segmentation without post-hoc clustering. To mitigate query conflicts and enhance boundary precision, it incorporates dual-level regularization, multi-view contrastive learning, and a dynamic spatial overlap penalty. Evaluated on complex indoor scene datasets, the method significantly outperforms current clustering-based baselines in segmentation accuracy, memory efficiency, and inference speed.

Technology Category

Application Category

📝 Abstract

While recent feed-forward 3D reconstruction models provide a strong geometric foundation for scene understanding, extending them to 3D instance segmentation typically relies on a disjointed "lift-and-cluster" paradigm. Grouping dense pixel-wise embeddings via non-differentiable clustering scales poorly with the number of views and disconnects representation learning from the final segmentation objective. In this paper, we present a Feed-forward Anchored Scene Transformer for 3D Instance Segmentation (FAST3DIS), an end-to-end approach that effectively bypasses post-hoc clustering. We introduce a 3D-anchored, query-based Transformer architecture built upon a foundational depth backbone, adapted efficiently to learn instance-specific semantics while retaining its zero-shot geometric priors. We formulate a learned 3D anchor generator coupled with an anchor-sampling cross-attention mechanism for view-consistent 3D instance segmentation. By projecting 3D object queries directly into multi-view feature maps, our method samples context efficiently. Furthermore, we introduce a dual-level regularization strategy, that couples multi-view contrastive learning with a dynamically scheduled spatial overlap penalty to explicitly prevent query collisions and ensure precise instance boundaries. Experiments on complex indoor 3D datasets demonstrate that our approach achieves competitive segmentation accuracy with significantly improved memory scalability and inference speed over state-of-the-art clustering-based methods.

Problem

Research questions and friction points this paper is trying to address.

3D instance segmentation

lift-and-cluster paradigm

non-differentiable clustering

multi-view scalability

representation learning disconnection

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D instance segmentation

feed-forward transformer

3D anchors