🤖 AI Summary
Existing 3D detection methods based on Denoising Diffusion Probabilistic Models (DDPMs) suffer from multi-step iterative inference, low efficiency, and insufficient robustness. To address these limitations, this paper proposes RSDNet, a single-stage fully sparse detection framework. Our key contributions are: (1) a separable latent diffusion mechanism enabling one-step denoising across hierarchical sparse feature spaces; (2) a semantic-geometric joint conditional guidance scheme that mitigates the absence of center-point features in sparse representations and enhances robustness against diverse and multi-level noise; and (3) a lightweight multi-level denoising autoencoder integrated into an end-to-end reconstruction pipeline. Evaluated on mainstream benchmarks, RSDNet achieves state-of-the-art detection performance with significantly reduced inference overhead, striking an optimal balance between high efficiency and strong robustness.
📝 Abstract
Denoising Diffusion Probabilistic Models (DDPMs) have shown success in robust 3D object detection tasks. Existing methods often rely on the score matching from 3D boxes or pre-trained diffusion priors. However, they typically require multi-step iterations in inference, which limits efficiency. To address this, we propose a extbf{R}obust single-stage fully extbf{S}parse 3D object extbf{D}etection extbf{Net}work with a Detachable Latent Framework (DLF) of DDPMs, named RSDNet. Specifically, RSDNet learns the denoising process in latent feature spaces through lightweight denoising networks like multi-level denoising autoencoders (DAEs). This enables RSDNet to effectively understand scene distributions under multi-level perturbations, achieving robust and reliable detection. Meanwhile, we reformulate the noising and denoising mechanisms of DDPMs, enabling DLF to construct multi-type and multi-level noise samples and targets, enhancing RSDNet robustness to multiple perturbations. Furthermore, a semantic-geometric conditional guidance is introduced to perceive the object boundaries and shapes, alleviating the center feature missing problem in sparse representations, enabling RSDNet to perform in a fully sparse detection pipeline. Moreover, the detachable denoising network design of DLF enables RSDNet to perform single-step detection in inference, further enhancing detection efficiency. Extensive experiments on public benchmarks show that RSDNet can outperform existing methods, achieving state-of-the-art detection.