🤖 AI Summary
This work addresses the challenge of robustly lifting 2D object detections to metric 3D bounding boxes in open-world scenarios where dense depth maps and 3D annotations are scarce. To this end, we propose BoxerNet, a Transformer-based architecture that integrates open-vocabulary 2D detectors (e.g., OWLv2, DETiC), multi-view geometric constraints, and optional depth cues—either sparse or dense. A median-depth block enables effective handling of sparse depth inputs, while aleatoric uncertainty modeling enhances regression robustness. By combining geometric filtering with multi-view fusion, BoxerNet produces globally consistent, de-duplicated 3D boxes. Trained on over 1.2 million unique 3D bounding boxes, our model significantly reduces reliance on costly 3D annotations, achieving an mAP of 0.532 in ego-centric settings without dense depth—substantially outperforming CuTR (0.010)—and 0.412 on CA-1M, surpassing CuTR’s 0.250.
📝 Abstract
Detecting and localizing objects in space is a fundamental computer vision problem. While much progress has been made to solve 2D object detection, 3D object localization is much less explored and far from solved, especially for open-world categories. To address this research challenge, we propose Boxer, an algorithm to estimate static 3D bounding boxes (3DBBs) from 2D open-vocabulary object detections, posed images and optional depth either represented as a sparse point cloud or dense depth. At its core is BoxerNet, a transformer-based network which lifts 2D bounding box (2DBB) proposals into 3D, followed by multi-view fusion and geometric filtering to produce globally consistent de-duplicated 3DBBs in metric world space. Boxer leverages the power of existing 2DBB detection algorithms (e.g. DETIC, OWLv2, SAM3) to localize objects in 2D. This allows the main BoxerNet model to focus on lifting to 3D rather than detecting, ultimately reducing the demand for costly annotated 3DBB training data. Extending the CuTR formulation, we incorporate an aleatoric uncertainty for robust regression, a median depth patch encoding to support sparse depth inputs, and large-scale training with over 1.2 million unique 3DBBs. BoxerNet outperforms state-of-the-art baselines in open-world 3DBB lifting, including CuTR in egocentric settings without dense depth (0.532 vs. 0.010 mAP) and on CA-1M with dense depth available (0.412 vs. 0.250 mAP).