🤖 AI Summary
To address the feature disconnection and insufficient fusion between bird’s-eye view (BEV) and perspective view (PV) representations in vision-only multi-view 3D object detection, this paper proposes an end-to-end BEV-PV full-space fusion paradigm. We design a cross-view Transformer decoder to jointly decode BEV and PV features within a unified latent space, introduce a dual-space collaborative feature enhancement mechanism, and integrate a multi-frame spatiotemporal modeling module to improve robustness in dynamic scenes. Departing from conventional designs—such as separate heads or partial local fusion—our approach achieves the first complete unification of BEV and PV features within a single inference pipeline. Evaluated on the nuScenes benchmark, our method establishes new state-of-the-art performance in both 3D object detection and BEV semantic segmentation, significantly outperforming leading approaches including BEVFormer and Sparse4D.
📝 Abstract
Multi-view camera-only 3D object detection largely follows two primary paradigms: exploiting bird's-eye-view (BEV) representations or focusing on perspective-view (PV) features, each with distinct advantages. Although several recent approaches explore combining BEV and PV, many rely on partial fusion or maintain separate detection heads. In this paper, we propose DuoSpaceNet, a novel framework that fully unifies BEV and PV feature spaces within a single detection pipeline for comprehensive 3D perception. Our design includes a decoder to integrate BEV and PV features into unified detection queries, as well as a feature enhancement strategy that enriches different feature representations. In addition, DuoSpaceNet can be extended to handle multi-frame inputs, enabling more robust temporal analysis. Extensive experiments on nuScenes dataset show that DuoSpaceNet surpasses both BEV-based baselines (e.g., BEVFormer) and PV-based baselines (e.g., Sparse4D) in 3D object detection and BEV map segmentation, verifying the effectiveness of our proposed design.