SPOT-Occ: Sparse Prototype-guided Transformer for Camera-based 3D Occupancy Prediction

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of sparse voxel decoding in camera-based real-time 3D occupancy prediction by proposing a prototype-guided sparse Transformer decoder. The method employs a two-stage guidance mechanism that integrates sparse prototype selection, dynamic feature focusing, and a ground-truth mask–based denoising training strategy. This enables queries to adaptively associate with salient voxel prototypes while maintaining stability across decoder layers. The proposed architecture substantially reduces computational complexity, achieving state-of-the-art 3D occupancy prediction accuracy without compromising inference speed.

Technology Category

Application Category

📝 Abstract
Achieving highly accurate and real-time 3D occupancy prediction from cameras is a critical requirement for the safe and practical deployment of autonomous vehicles. While this shift to sparse 3D representations solves the encoding bottleneck, it creates a new challenge for the decoder: how to efficiently aggregate information from a sparse, non-uniformly distributed set of voxel features without resorting to computationally prohibitive dense attention. In this paper, we propose a novel Prototype-based Sparse Transformer Decoder that replaces this costly interaction with an efficient, two-stage process of guided feature selection and focused aggregation. Our core idea is to make the decoder's attention prototype-guided. We achieve this through a sparse prototype selection mechanism, where each query adaptively identifies a compact set of the most salient voxel features, termed prototypes, for focused feature aggregation. To ensure this dynamic selection is stable and effective, we introduce a complementary denoising paradigm. This approach leverages ground-truth masks to provide explicit guidance, guaranteeing a consistent query-prototype association across decoder layers. Our model, dubbed SPOT-Occ, outperforms previous methods with a significant margin in speed while also improving accuracy. Source code is released at https://github.com/chensuzeyu/SpotOcc.
Problem

Research questions and friction points this paper is trying to address.

3D occupancy prediction
sparse representation
camera-based perception
autonomous driving
efficient attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Prototype
Transformer Decoder
3D Occupancy Prediction
Guided Feature Selection
Denoising Paradigm
S
Suzeyu Chen
Artificial Intelligence Thrust, The Hong Kong University of Science and Technology (Guangzhou)
Leheng Li
Leheng Li
HKUST(GZ)
Computer Vision
Ying-Cong Chen
Ying-Cong Chen
Hong Kong University of Science and Technology (Guangzhou)
Computer Vision and Pattern Recognition