SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection

📅 2026-04-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
This work addresses the performance degradation of long-tailed categories—such as children and strollers—in vision-only 3D object detection, which stems from data scarcity, inter-class ambiguity, and intra-class diversity. To mitigate these challenges, the authors propose a semantic-guided expert distillation framework that integrates CLIP’s language priors to enhance discriminative feature learning. Specifically, they design a language-guided mixture-of-experts module that enables semantic-aware routing of 3D queries and introduce semantic projection distillation to align 3D queries with 2D semantic features. Experimental results demonstrate that the proposed method significantly improves detection accuracy for rare classes while preserving overall performance, and further enhances model robustness under appearance variations and edge-case scenarios.

Technology Category

Application Category

📝 Abstract
Camera-only 3D object detection has emerged as a cost-effective and scalable alternative to LiDAR for autonomous driving, yet existing methods primarily prioritize overall performance while overlooking the severe long-tail imbalance inherent in real-world datasets. In practice, many rare but safety-critical categories such as children, strollers, or emergency vehicles are heavily underrepresented, leading to biased learning and degraded performance. This challenge is further exacerbated by pronounced inter-class ambiguity (e.g., visually similar subclasses) and substantial intra-class diversity (e.g., objects varying widely in appearance, scale, pose, or context), which together hinder reliable long-tail recognition. In this work, we introduce SemLT3D, a Semantic-Guided Expert Distillation framework designed to enrich the representation space for underrepresented classes through semantic priors. SemLT3D consists of: (1) a language-guided mixture-of-experts module that routes 3D queries to specialized experts according to their semantic affinity, enabling the model to better disentangle confusing classes and specialize on tail distributions; and (2) a semantic projection distillation pipeline that aligns 3D queries with CLIP-informed 2D semantics, producing more coherent and discriminative features across diverse visual manifestations. Although motivated by long-tail imbalance, the semantically structured learning in SemLT3D also improves robustness under broader appearance variations and challenging corner cases, offering a principled step toward more reliable camera-only 3D perception.
Problem

Research questions and friction points this paper is trying to address.

long-tailed 3D object detection
camera-only perception
class imbalance
inter-class ambiguity
intra-class diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-Guided Expert Distillation
Long-Tailed 3D Object Detection
Camera-only Perception
Mixture-of-Experts
CLIP-based Semantic Alignment