Unleashing HyDRa: Hybrid Fusion, Depth Consistency and Radar for Unified 3D Perception

📅 2024-03-12
🏛️ arXiv.org
📈 Citations: 8
Influential: 1
📄 PDF
🤖 AI Summary
To address the unreliability of pure-vision 3D perception in long-range, low-light, and adverse weather conditions—particularly for depth estimation—this paper proposes HyDRa, a novel camera-radar fusion framework. HyDRa introduces three key innovations: (1) a hybrid representation-space fusion mechanism enabling precise cross-modal feature alignment; (2) a height-aware transformer that enhances robustness in depth estimation; and (3) a radar-weighted depth consistency module that refines sparse bird’s-eye-view (BEV) representations and regularizes depth distribution. The framework integrates a dense BEV backbone with semantic-enhanced BEV-to-occupancy transformation. Evaluated on nuScenes, HyDRa achieves 64.2 NDS and 58.4 AMOTA—state-of-the-art performance—while improving occupancy prediction mIoU by 3.7 points over the best camera-only method.

Technology Category

Application Category

📝 Abstract
Low-cost, vision-centric 3D perception systems for autonomous driving have made significant progress in recent years, narrowing the gap to expensive LiDAR-based methods. The primary challenge in becoming a fully reliable alternative lies in robust depth prediction capabilities, as camera-based systems struggle with long detection ranges and adverse lighting and weather conditions. In this work, we introduce HyDRa, a novel camera-radar fusion architecture for diverse 3D perception tasks. Building upon the principles of dense BEV (Bird's Eye View)-based architectures, HyDRa introduces a hybrid fusion approach to combine the strengths of complementary camera and radar features in two distinct representation spaces. Our Height Association Transformer module leverages radar features already in the perspective view to produce more robust and accurate depth predictions. In the BEV, we refine the initial sparse representation by a Radar-weighted Depth Consistency. HyDRa achieves a new state-of-the-art for camera-radar fusion of 64.2 NDS (+1.8) and 58.4 AMOTA (+1.5) on the public nuScenes dataset. Moreover, our new semantically rich and spatially accurate BEV features can be directly converted into a powerful occupancy representation, beating all previous camera-based methods on the Occ3D benchmark by an impressive 3.7 mIoU. Code and models are available at https://github.com/phi-wol/hydra.
Problem

Research questions and friction points this paper is trying to address.

Enhance depth prediction in camera-based 3D perception systems
Improve 3D perception under adverse lighting and weather conditions
Develop a hybrid camera-radar fusion for autonomous driving
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid fusion of camera and radar features
Height Association Transformer for depth prediction
Radar-weighted Depth Consistency in BEV representation
🔎 Similar Papers
No similar papers found.