🤖 AI Summary
This work proposes LH3D, a learnability-oriented active learning framework for monocular 3D object detection from roadside cameras, addressing the challenges of annotation difficulty and low model learnability. LH3D introduces learnability as the core sampling criterion, explicitly suppressing inherently ambiguous and hard-to-annotate samples. By integrating submodular optimization, the framework enhances both annotation efficiency and model performance while ensuring informative coverage. Evaluated on the DAIR-V2X-I dataset, LH3D achieves 86.06% (vehicles), 67.32% (pedestrians), and 78.67% (cyclists) of the fully supervised performance using only 25% of the annotation budget, significantly outperforming conventional uncertainty-based active learning approaches.
📝 Abstract
Roadside perception datasets are typically constructed via cooperative labeling between synchronized vehicle and roadside frame pairs. However, real deployment often requires annotation of roadside-only data due to hardware and privacy constraints. Even human experts struggle to produce accurate labels without vehicle-side data (image, LIDAR), which not only increases annotation difficulty and cost, but also reveals a fundamental learnability problem: many roadside-only scenes contain distant, blurred, or occluded objects whose 3D properties are ambiguous from a single view and can only be reliably annotated by cross-checking paired vehicle--roadside frames. We refer to such cases as inherently ambiguous samples. To reduce wasted annotation effort on inherently ambiguous samples while still obtaining high-performing models, we turn to active learning. This work focuses on active learning for roadside monocular 3D object detection and proposes a learnability-driven framework that selects scenes which are both informative and reliably labelable, suppressing inherently ambiguous samples while ensuring coverage. Experiments demonstrate that our method, LH3D, achieves 86.06%, 67.32%, and 78.67% of full-performance for vehicles, pedestrians, and cyclists respectively, using only 25% of the annotation budget on DAIR-V2X-I, significantly outperforming uncertainty-based baselines. This confirms that learnability, not uncertainty, matters for roadside 3D perception.