WildDet3D: Scaling Promptable 3D Detection in the Wild

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Monocular 3D object detection in open-world settings suffers from limited category generalization, reliance on a single prompt modality, and challenges in integrating geometric cues, further hindered by the small scale and narrow scope of existing datasets. This work proposes WildDet3D, a unified geometry-aware architecture that, for the first time, supports multimodal prompt inputs—including text, points, and bounding boxes—and fuses depth information during inference. Concurrently, we introduce WildDet3D-Data, a large-scale open-world dataset featuring 13.5K categories and over one million human-verified real-world 3D annotations. Experiments demonstrate that our approach achieves 24.8 and 36.4 AP3D on WildDet3D-Bench and Omni3D, respectively, and attains 40.3/48.9 ODS in zero-shot transfer to Argoverse 2 and ScanNet. Incorporating depth information yields an average improvement of 20.7 AP.

Technology Category

Application Category

📝 Abstract

Understanding objects in 3D from a single image is a cornerstone of spatial intelligence. A key step toward this goal is monocular 3D object detection--recovering the extent, location, and orientation of objects from an input RGB image. To be practical in the open world, such a detector must generalize beyond closed-set categories, support diverse prompt modalities, and leverage geometric cues when available. Progress is hampered by two bottlenecks: existing methods are designed for a single prompt type and lack a mechanism to incorporate additional geometric cues, and current 3D datasets cover only narrow categories in controlled environments, limiting open-world transfer. In this work we address both gaps. First, we introduce WildDet3D, a unified geometry-aware architecture that natively accepts text, point, and box prompts and can incorporate auxiliary depth signals at inference time. Second, we present WildDet3D-Data, the largest open 3D detection dataset to date, constructed by generating candidate 3D boxes from existing 2D annotations and retaining only human-verified ones, yielding over 1M images across 13.5K categories in diverse real-world scenes. WildDet3D establishes a new state-of-the-art across multiple benchmarks and settings. In the open-world setting, it achieves 22.6/24.8 AP3D on our newly introduced WildDet3D-Bench with text and box prompts. On Omni3D, it reaches 34.2/36.4 AP3D with text and box prompts, respectively. In zero-shot evaluation, it achieves 40.3/48.9 ODS on Argoverse 2 and ScanNet. Notably, incorporating depth cues at inference time yields substantial additional gains (+20.7 AP on average across settings).

Problem

Research questions and friction points this paper is trying to address.

monocular 3D object detection

open-world generalization

prompt modalities

geometric cues

3D datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

promptable 3D detection

geometry-aware architecture

open-world generalization