🤖 AI Summary
This work addresses the performance degradation of long-tailed categories—such as children and strollers—in vision-only 3D object detection, which stems from data scarcity, inter-class ambiguity, and intra-class diversity. To mitigate these challenges, the authors propose a semantic-guided expert distillation framework that integrates CLIP’s language priors to enhance discriminative feature learning. Specifically, they design a language-guided mixture-of-experts module that enables semantic-aware routing of 3D queries and introduce semantic projection distillation to align 3D queries with 2D semantic features. Experimental results demonstrate that the proposed method significantly improves detection accuracy for rare classes while preserving overall performance, and further enhances model robustness under appearance variations and edge-case scenarios.
📝 Abstract
Camera-only 3D object detection has emerged as a cost-effective and scalable alternative to LiDAR for autonomous driving, yet existing methods primarily prioritize overall performance while overlooking the severe long-tail imbalance inherent in real-world datasets. In practice, many rare but safety-critical categories such as children, strollers, or emergency vehicles are heavily underrepresented, leading to biased learning and degraded performance. This challenge is further exacerbated by pronounced inter-class ambiguity (e.g., visually similar subclasses) and substantial intra-class diversity (e.g., objects varying widely in appearance, scale, pose, or context), which together hinder reliable long-tail recognition. In this work, we introduce SemLT3D, a Semantic-Guided Expert Distillation framework designed to enrich the representation space for underrepresented classes through semantic priors. SemLT3D consists of: (1) a language-guided mixture-of-experts module that routes 3D queries to specialized experts according to their semantic affinity, enabling the model to better disentangle confusing classes and specialize on tail distributions; and (2) a semantic projection distillation pipeline that aligns 3D queries with CLIP-informed 2D semantics, producing more coherent and discriminative features across diverse visual manifestations. Although motivated by long-tail imbalance, the semantically structured learning in SemLT3D also improves robustness under broader appearance variations and challenging corner cases, offering a principled step toward more reliable camera-only 3D perception.