🤖 AI Summary
Monocular 3D object detection suffers from unstable size estimation under a unified multi-class setting due to scale-depth ambiguity and challenges such as occlusion and truncation. This work proposes MonoPRIO, the first framework to introduce an adaptive prior routing mechanism within a unified multi-category paradigm. It employs category-aware offline size prototypes to guide decoder queries toward a soft mixture of priors and incorporates uncertainty-aware log-space conditioning along with Cluster-Aligned Prior (CAP) regularization to effectively mitigate size ambiguity when image evidence is insufficient. Evaluated on the KITTI test set, MonoPRIO achieves state-of-the-art performance across all three categories—Car, Pedestrian, and Cyclist—in the full multi-class setting. Notably, trained solely on the Car category, it attains the highest 3D AP on Easy, Moderate, and Hard difficulty levels while exhibiting significantly lower computational overhead than MonoCLUE.
📝 Abstract
Monocular 3D object detection remains challenging because metric size and depth are underdetermined by single-view evidence, particularly under occlusion, truncation, and projection-induced scale-depth ambiguity. Although recent methods improve depth and geometric reasoning, metric size remains unstable in unified multi-class settings, where class variability and partial visibility broaden plausible size modes. We propose MonoPRIO, a unified monocular 3D detector that targets this bottleneck through adaptive prior conditioning in the size pathway. MonoPRIO constructs class-aware size prototypes offline, routes each decoder query to a soft mixture prior, applies uncertainty-aware log-space conditioning, and uses Cluster-Aligned Prior (CAP) regularisation on matched positives during training. On the official KITTI test server, MonoPRIO achieves the strongest fully reported unified multi-class result among methods reporting complete Car, Pedestrian, and Cyclist metrics. In the car-only setting, it also achieves the strongest 3D bounding-box AP across Easy/Moderate/Hard categories among compared methods without extra data, while using substantially less compute than MonoCLUE. Ablations and diagnostics show complementary gains from routed injection and CAP, with the largest benefits in ambiguity-prone, partially occluded, and low-data regimes. These findings indicate that adaptive priors are most effective when image evidence underdetermines metric size, while atypical geometry or extreme visibility loss can still cause mismatch between routed priors and true instance geometry. Code, trained models, result logs, and reproducibility material are available at https://github.com/bigggs/MonoPRIO.