PLOT: Pseudo-Labeling via Video Object Tracking for Scalable Monocular 3D Object Detection

📅 2025-07-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Monocular 3D object detection (M3OD) suffers from prohibitively high annotation costs and inherent depth ambiguity in 2D imagery, resulting in scarce high-quality labeled data. To address this, we propose a self-supervised pseudo-labeling framework that operates solely on monocular video sequences—requiring no LiDAR, multi-view inputs, camera pose estimates, or shape priors. Our method leverages cross-frame object tracking to construct temporally consistent pseudo-LiDAR point clouds for both static and dynamic objects, then integrates a weakly supervised pseudo-label generation mechanism to enable end-to-end estimation of 3D attributes (i.e., 3D location, dimensions, and orientation). This design significantly enhances robustness under occlusion and complex scene conditions. Evaluated on KITTI and nuScenes, our approach achieves state-of-the-art performance while demonstrating strong scalability. It establishes a novel, cost-effective paradigm for practical monocular 3D detection.

Technology Category

Application Category

📝 Abstract
Monocular 3D object detection (M3OD) has long faced challenges due to data scarcity caused by high annotation costs and inherent 2D-to-3D ambiguity. Although various weakly supervised methods and pseudo-labeling methods have been proposed to address these issues, they are mostly limited by domain-specific learning or rely solely on shape information from a single observation. In this paper, we propose a novel pseudo-labeling framework that uses only video data and is more robust to occlusion, without requiring a multi-view setup, additional sensors, camera poses, or domain-specific training. Specifically, we explore a technique for aggregating the pseudo-LiDARs of both static and dynamic objects across temporally adjacent frames using object point tracking, enabling 3D attribute extraction in scenarios where 3D data acquisition is infeasible. Extensive experiments demonstrate that our method ensures reliable accuracy and strong scalability, making it a practical and effective solution for M3OD.
Problem

Research questions and friction points this paper is trying to address.

Addresses data scarcity in monocular 3D object detection
Enhances robustness to occlusion without multi-view setups
Enables 3D attribute extraction from video data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses video data for pseudo-labeling
Aggregates pseudo-LiDARs via object tracking
Extracts 3D attributes without multi-view setup
🔎 Similar Papers
No similar papers found.
S
Seokyeong Lee
Korea Institute of Science and Technology (KIST)
S
Sithu Aung
Korea Institute of Science and Technology (KIST)
J
Junyong Choi
Korea Institute of Science and Technology (KIST) and Korea University
Seungryong Kim
Seungryong Kim
Associate Professor, KAIST
Computer VisionMachine Learning
Ig-Jae Kim
Ig-Jae Kim
KIST
Deep LearningComputer GraphicsComputer VisionImage Processing
J
Junghyun Cho
Korea Institute of Science and Technology (KIST), AI-Robotics, KIST School, University of Science and Technology (UST), Yonsei-KIST Convergence Research Institute, Yonsei University