🤖 AI Summary
Monocular 3D object detection (M3OD) suffers from prohibitively high annotation costs and inherent depth ambiguity in 2D imagery, resulting in scarce high-quality labeled data. To address this, we propose a self-supervised pseudo-labeling framework that operates solely on monocular video sequences—requiring no LiDAR, multi-view inputs, camera pose estimates, or shape priors. Our method leverages cross-frame object tracking to construct temporally consistent pseudo-LiDAR point clouds for both static and dynamic objects, then integrates a weakly supervised pseudo-label generation mechanism to enable end-to-end estimation of 3D attributes (i.e., 3D location, dimensions, and orientation). This design significantly enhances robustness under occlusion and complex scene conditions. Evaluated on KITTI and nuScenes, our approach achieves state-of-the-art performance while demonstrating strong scalability. It establishes a novel, cost-effective paradigm for practical monocular 3D detection.
📝 Abstract
Monocular 3D object detection (M3OD) has long faced challenges due to data scarcity caused by high annotation costs and inherent 2D-to-3D ambiguity. Although various weakly supervised methods and pseudo-labeling methods have been proposed to address these issues, they are mostly limited by domain-specific learning or rely solely on shape information from a single observation. In this paper, we propose a novel pseudo-labeling framework that uses only video data and is more robust to occlusion, without requiring a multi-view setup, additional sensors, camera poses, or domain-specific training. Specifically, we explore a technique for aggregating the pseudo-LiDARs of both static and dynamic objects across temporally adjacent frames using object point tracking, enabling 3D attribute extraction in scenarios where 3D data acquisition is infeasible. Extensive experiments demonstrate that our method ensures reliable accuracy and strong scalability, making it a practical and effective solution for M3OD.