Collaborative Perceiver: Elevating Vision-based 3D Object Detection via Local Density-Aware Spatial Occupancy

📅 2025-07-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-based BEV 3D detection methods neglect intrinsic scene background (e.g., roads, sidewalks), leading to insufficient environmental context modeling. To address this, we propose CoP, a multi-task learning framework that introduces **local density-aware spatial occupancy estimation** as a novel auxiliary task. CoP employs two key modules: **voxel-height-guided sampling (VHS)** to prioritize geometrically informative regions, and **global–local collaborative feature fusion (CFF)** to enable mutual knowledge transfer and feature enhancement between detection and occupancy tasks. By explicitly modeling scene geometry and density distribution, CoP strengthens BEV representation learning and improves detection accuracy. Evaluated on the nuScenes test set, CoP achieves 49.5% mAP and 59.2% NDS—outperforming all prior pure-vision methods by a significant margin. This demonstrates the effectiveness of joint geometric and density-aware modeling for BEV 3D perception.

Technology Category

Application Category

📝 Abstract
Vision-based bird's-eye-view (BEV) 3D object detection has advanced significantly in autonomous driving by offering cost-effectiveness and rich contextual information. However, existing methods often construct BEV representations by collapsing extracted object features, neglecting intrinsic environmental contexts, such as roads and pavements. This hinders detectors from comprehensively perceiving the characteristics of the physical world. To alleviate this, we introduce a multi-task learning framework, Collaborative Perceiver (CoP), that leverages spatial occupancy as auxiliary information to mine consistent structural and conceptual similarities shared between 3D object detection and occupancy prediction tasks, bridging gaps in spatial representations and feature refinement. To this end, we first propose a pipeline to generate dense occupancy ground truths incorporating local density information (LDO) for reconstructing detailed environmental information. Next, we employ a voxel-height-guided sampling (VHS) strategy to distill fine-grained local features according to distinct object properties. Furthermore, we develop a global-local collaborative feature fusion (CFF) module that seamlessly integrates complementary knowledge between both tasks, thus composing more robust BEV representations. Extensive experiments on the nuScenes benchmark demonstrate that CoP outperforms existing vision-based frameworks, achieving 49.5% mAP and 59.2% NDS on the test set. Code and supplementary materials are available at this link https://github.com/jichengyuan/Collaborative-Perceiver.
Problem

Research questions and friction points this paper is trying to address.

Enhances 3D object detection by incorporating spatial occupancy and local density
Addresses neglect of environmental contexts in BEV representation construction
Improves feature refinement via multi-task learning and collaborative fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates dense occupancy ground truths with LDO
Uses voxel-height-guided sampling for feature distillation
Integrates global-local features via CFF module
🔎 Similar Papers
No similar papers found.
J
Jicheng Yuan
Open Distributed Systems (ODS) Group at the Technische Universität Berlin and Fraunhofer FOKUS (Berlin, Germany)
M
Manh Nguyen Duc
Open Distributed Systems (ODS) Group at the Technische Universität Berlin and Fraunhofer FOKUS (Berlin, Germany)
Q
Qian Liu
Open Distributed Systems (ODS) Group at the Technische Universität Berlin and Fraunhofer FOKUS (Berlin, Germany)
Manfred Hauswirth
Manfred Hauswirth
Full professor TU Berlin and managing director Fraunhofer FOKUS
Internet of EverythingDistributed Inform. Syst.Linked Data Streams(Semantic) Sensor Networks and MiddlewareSemantic Web
Danh Le Phuoc
Danh Le Phuoc
Group Leader and Principle Investigator at Technical University of Berlin
Semantic Stream Processing and ReasoninSemantic Sensor Network/MiddlewareMobile DatabaseSemantic MashupEmbedded System/I