DuoSpaceNet: Leveraging Both Bird's-Eye-View and Perspective View Representations for 3D Object Detection

📅 2024-05-17
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the feature disconnection and insufficient fusion between bird’s-eye view (BEV) and perspective view (PV) representations in vision-only multi-view 3D object detection, this paper proposes an end-to-end BEV-PV full-space fusion paradigm. We design a cross-view Transformer decoder to jointly decode BEV and PV features within a unified latent space, introduce a dual-space collaborative feature enhancement mechanism, and integrate a multi-frame spatiotemporal modeling module to improve robustness in dynamic scenes. Departing from conventional designs—such as separate heads or partial local fusion—our approach achieves the first complete unification of BEV and PV features within a single inference pipeline. Evaluated on the nuScenes benchmark, our method establishes new state-of-the-art performance in both 3D object detection and BEV semantic segmentation, significantly outperforming leading approaches including BEVFormer and Sparse4D.

Technology Category

Application Category

📝 Abstract
Multi-view camera-only 3D object detection largely follows two primary paradigms: exploiting bird's-eye-view (BEV) representations or focusing on perspective-view (PV) features, each with distinct advantages. Although several recent approaches explore combining BEV and PV, many rely on partial fusion or maintain separate detection heads. In this paper, we propose DuoSpaceNet, a novel framework that fully unifies BEV and PV feature spaces within a single detection pipeline for comprehensive 3D perception. Our design includes a decoder to integrate BEV and PV features into unified detection queries, as well as a feature enhancement strategy that enriches different feature representations. In addition, DuoSpaceNet can be extended to handle multi-frame inputs, enabling more robust temporal analysis. Extensive experiments on nuScenes dataset show that DuoSpaceNet surpasses both BEV-based baselines (e.g., BEVFormer) and PV-based baselines (e.g., Sparse4D) in 3D object detection and BEV map segmentation, verifying the effectiveness of our proposed design.
Problem

Research questions and friction points this paper is trying to address.

Unify BEV and PV features for 3D object detection
Enhance feature integration with a single detection pipeline
Improve temporal analysis with multi-frame input handling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies BEV and PV in single pipeline
Integrates features with unified queries
Enhances multi-frame temporal analysis
🔎 Similar Papers
No similar papers found.