PLOT: Pseudo-Labeling via Video Object Tracking for Scalable Monocular 3D Object Detection

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Monocular 3D object detection (M3OD) suffers from prohibitively high annotation costs and inherent depth ambiguity in 2D imagery, resulting in scarce high-quality labeled data. To address this, we propose a self-supervised pseudo-labeling framework that operates solely on monocular video sequences—requiring no LiDAR, multi-view inputs, camera pose estimates, or shape priors. Our method leverages cross-frame object tracking to construct temporally consistent pseudo-LiDAR point clouds for both static and dynamic objects, then integrates a weakly supervised pseudo-label generation mechanism to enable end-to-end estimation of 3D attributes (i.e., 3D location, dimensions, and orientation). This design significantly enhances robustness under occlusion and complex scene conditions. Evaluated on KITTI and nuScenes, our approach achieves state-of-the-art performance while demonstrating strong scalability. It establishes a novel, cost-effective paradigm for practical monocular 3D detection.

Technology Category

Application Category

📝 Abstract

Monocular 3D object detection (M3OD) has long faced challenges due to data scarcity caused by high annotation costs and inherent 2D-to-3D ambiguity. Although various weakly supervised methods and pseudo-labeling methods have been proposed to address these issues, they are mostly limited by domain-specific learning or rely solely on shape information from a single observation. In this paper, we propose a novel pseudo-labeling framework that uses only video data and is more robust to occlusion, without requiring a multi-view setup, additional sensors, camera poses, or domain-specific training. Specifically, we explore a technique for aggregating the pseudo-LiDARs of both static and dynamic objects across temporally adjacent frames using object point tracking, enabling 3D attribute extraction in scenarios where 3D data acquisition is infeasible. Extensive experiments demonstrate that our method ensures reliable accuracy and strong scalability, making it a practical and effective solution for M3OD.

Problem

Research questions and friction points this paper is trying to address.

Addresses data scarcity in monocular 3D object detection

Enhances robustness to occlusion without multi-view setups

Enables 3D attribute extraction from video data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses video data for pseudo-labeling

Aggregates pseudo-LiDARs via object tracking

Extracts 3D attributes without multi-view setup

🔎 Similar Papers

No similar papers found.