🤖 AI Summary
Point Transformers face a trade-off: enlarging the receptive field (RF) often dilutes group-wise attention, impairing fine-grained feature modeling. Existing proxy-based approaches suffer from critical limitations—global proxies induce O(N²) complexity and positional ambiguity, while local proxies lack geometric robustness, exhibit inefficient inter-patch interaction, and fail to balance local-global fusion. To address these issues, we propose the Dual-Stream Sparse Proxy Transformer (DSP-T). DSP-T introduces spatially adaptive vertex-associated local proxy sampling to preserve geometric consistency, and designs a sparse proxy attention mechanism coupled with lookup-table-based relative positional biases for efficient and precise proxy-point interaction. Evaluated on semantic and instance segmentation benchmarks, DSP-T achieves state-of-the-art performance and demonstrates robust multi-scale point cloud processing capability. The code is publicly available.
📝 Abstract
In 3D understanding, point transformers have yielded significant advances in broadening the receptive field. However, further enhancement of the receptive field is hindered by the constraints of grouping attention. The proxy-based model, as a hot topic in image and language feature extraction, uses global or local proxies to expand the model's receptive field. But global proxy-based methods fail to precisely determine proxy positions and are not suited for tasks like segmentation and detection in the point cloud, and exist local proxy-based methods for image face difficulties in global-local balance, proxy sampling in various point clouds, and parallel cross-attention computation for sparse association. In this paper, we present SP$^2$T, a local proxy-based dual stream point transformer, which promotes global receptive field while maintaining a balance between local and global information. To tackle robust 3D proxy sampling, we propose a spatial-wise proxy sampling with vertex-based point proxy associations, ensuring robust point-cloud sampling in many scales of point cloud. To resolve economical association computation, we introduce sparse proxy attention combined with table-based relative bias, which enables low-cost and precise interactions between proxy and point features. Comprehensive experiments across multiple datasets reveal that our model achieves SOTA performance in downstream tasks. The code has been released in https://github.com/TerenceWallel/Sparse-Proxy-Point-Transformer .