Long-Tailed 3D Detection via Multi-Modal Fusion

📅 2023-12-18
📈 Citations: 4
Influential: 0
📄 PDF

career value

206K/year
🤖 AI Summary
This work addresses the severe performance degradation in 3D object detection for rare classes—such as ambulances and strollers—under long-tailed class distributions in autonomous driving. We formally define and tackle the Long-Tailed 3D Detection (LT3D) task for the first time. To mitigate class imbalance, we propose a hierarchical loss function that encourages cross-class feature sharing; design a semantics-aware, diagnostic evaluation metric; and introduce a Multi-Modal Late Fusion (MMLF) framework that decouples LiDAR and RGB modality training to enable large-scale single-modality data reuse. Additionally, we incorporate a 3D/2D detection matching strategy and a fusion decision mechanism. On the LT3D benchmark, our method achieves a rare-class mAP of 20.0, outperforming the strongest baseline by 7.2 percentage points and surpassing all existing approaches.
📝 Abstract
Contemporary autonomous vehicle (AV) benchmarks have advanced techniques for training 3D detectors, particularly on large-scale multi-modal (LiDAR + RGB) data. Surprisingly, although semantic class labels naturally follow a long-tailed distribution, existing benchmarks only focus on a few common classes (e.g., pedestrian and car) and neglect many rare but crucial classes (e.g., emergency vehicle and stroller). However, AVs must reliably detect both common and rare classes for safe operation in the open world. We address this challenge by formally studying the problem of Long-Tailed 3D Detection (LT3D), which evaluates all annotated classes, including those in-the-tail. We address LT3D with hierarchical losses that promote feature sharing across classes, and introduce diagnostic metrics that award partial credit to ``reasonable'' mistakes with respect to the semantic hierarchy (e.g., mistaking a child for an adult). Further, we point out that rare-class accuracy is particularly improved via multi-modal late fusion (MMLF) of independently trained uni-modal LiDAR and RGB detectors. Importantly, such an MMLF framework allows us to leverage large-scale uni-modal datasets (with more examples for rare classes) to train better uni-modal detectors, unlike prevailing end-to-end trained multi-modal detectors that require paired multi-modal data. Finally, we examine three critical components of our simple MMLF approach from first principles and investigate whether to train 2D or 3D RGB detectors for fusion, whether to match RGB and LiDAR detections in 3D or the projected 2D image plane, and how to fuse matched detections. Our proposed MMLF approach significantly improves LT3D performance over prior work, particularly improving rare class performance from 12.8 to 20.0 mAP!
Problem

Research questions and friction points this paper is trying to address.

Addressing long-tailed 3D detection for autonomous vehicles
Improving rare-class recognition via multi-modal fusion
Evaluating hierarchical losses and diagnostic metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical losses promote feature sharing across classes
Multi-modal late fusion of LiDAR and RGB detectors
2D RGB detectors and 2D matching improve rare class accuracy