FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection

📅 2026-03-09

📈 Citations: 1

✨ Influential: 0

career value

238K/year

🤖 AI Summary

This work addresses the challenge of detecting rare yet safety-critical 3D objects—such as construction workers—under long-tailed distributions in autonomous driving. The authors propose a multimodal two-stage detection framework that, for the first time, leverages vision foundation models (OWLv2 and Metric3Dv2) to provide semantic and depth priors. A novel camera branch is designed to incorporate these priors, and an attention-based mechanism is employed to fuse LiDAR point cloud features with image features, thereby enhancing both proposal generation and refinement. Experiments on real-world driving data demonstrate that the proposed method significantly improves 3D detection performance on long-tailed categories, validating the effectiveness of integrating vision foundation model priors with multimodal fusion strategies.

Technology Category

Application Category

📝 Abstract

In order to navigate complex traffic environments, self-driving vehicles must recognize many semantic classes pertaining to vulnerable road users or traffic control devices. However, many safety-critical objects (e.g., construction worker) appear infrequently in nominal traffic conditions, leading to a severe shortage of training examples from driving data alone. Recent vision foundation models, which are trained on a large corpus of data, can serve as a good source of external prior knowledge to improve generalization. We propose FOMO-3D, the first multi-modal 3D detector to leverage vision foundation models for long-tailed 3D detection. Specifically, FOMO-3D exploits rich semantic and depth priors from OWLv2 and Metric3Dv2 within a two-stage detection paradigm that first generates proposals with a LiDAR-based branch and a novel camera-based branch, and refines them with attention especially to image features from OWL. Evaluations on real-world driving data show that using rich priors from vision foundation models with careful multi-modal fusion designs leads to large gains for long-tailed 3D detection. Project website is at https://waabi.ai/fomo3d/.

Problem

Research questions and friction points this paper is trying to address.

long-tailed 3D object detection

vision foundation models

autonomous driving

data scarcity

safety-critical objects

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Foundation Models

Long-tailed 3D Object Detection

Multi-modal Fusion