Robust Single-Stage Fully Sparse 3D Object Detection via Detachable Latent Diffusion

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D detection methods based on Denoising Diffusion Probabilistic Models (DDPMs) suffer from multi-step iterative inference, low efficiency, and insufficient robustness. To address these limitations, this paper proposes RSDNet, a single-stage fully sparse detection framework. Our key contributions are: (1) a separable latent diffusion mechanism enabling one-step denoising across hierarchical sparse feature spaces; (2) a semantic-geometric joint conditional guidance scheme that mitigates the absence of center-point features in sparse representations and enhances robustness against diverse and multi-level noise; and (3) a lightweight multi-level denoising autoencoder integrated into an end-to-end reconstruction pipeline. Evaluated on mainstream benchmarks, RSDNet achieves state-of-the-art detection performance with significantly reduced inference overhead, striking an optimal balance between high efficiency and strong robustness.

Technology Category

Application Category

📝 Abstract
Denoising Diffusion Probabilistic Models (DDPMs) have shown success in robust 3D object detection tasks. Existing methods often rely on the score matching from 3D boxes or pre-trained diffusion priors. However, they typically require multi-step iterations in inference, which limits efficiency. To address this, we propose a extbf{R}obust single-stage fully extbf{S}parse 3D object extbf{D}etection extbf{Net}work with a Detachable Latent Framework (DLF) of DDPMs, named RSDNet. Specifically, RSDNet learns the denoising process in latent feature spaces through lightweight denoising networks like multi-level denoising autoencoders (DAEs). This enables RSDNet to effectively understand scene distributions under multi-level perturbations, achieving robust and reliable detection. Meanwhile, we reformulate the noising and denoising mechanisms of DDPMs, enabling DLF to construct multi-type and multi-level noise samples and targets, enhancing RSDNet robustness to multiple perturbations. Furthermore, a semantic-geometric conditional guidance is introduced to perceive the object boundaries and shapes, alleviating the center feature missing problem in sparse representations, enabling RSDNet to perform in a fully sparse detection pipeline. Moreover, the detachable denoising network design of DLF enables RSDNet to perform single-step detection in inference, further enhancing detection efficiency. Extensive experiments on public benchmarks show that RSDNet can outperform existing methods, achieving state-of-the-art detection.
Problem

Research questions and friction points this paper is trying to address.

Improves efficiency in 3D object detection with single-step inference
Enhances robustness to multi-level perturbations via denoising networks
Addresses sparse representation issues with semantic-geometric guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Detachable Latent Framework for efficient denoising
Semantic-geometric guidance for object perception
Single-step detection for enhanced efficiency
🔎 Similar Papers
No similar papers found.