🤖 AI Summary
This work addresses the challenge of object detection in medical imaging under weak supervision, where only point-level annotations are available, making accurate bounding box inference difficult amidst complex scenarios such as anatomical overlap and scale variation. To this end, the authors propose DExTeR, a Point-DETR-based Transformer architecture for point-to-box regression. DExTeR enhances category-specific feature extraction through class-guided deformable attention, improves instance discrimination via a CLICK-MoE (Category-Instance-Common Knowledge Mixture of Experts) module, and incorporates a multi-point consistency training strategy to boost robustness against variations in annotation placement. Evaluated on three diverse medical imaging datasets—endoscopy, chest X-ray, and endoscopic ultrasound—DExTeR achieves state-of-the-art detection performance while substantially reducing annotation costs.
📝 Abstract
Detecting anatomical landmarks in medical imaging is essential for diagnosis and intervention guidance. However, object detection models rely on costly bounding box annotations, limiting scalability. Weakly Semi-Supervised Object Detection (WSSOD) with point annotations proposes annotating each instance with a single point, minimizing annotation time while preserving localization signals. A Point-to-Box teacher model, trained on a small box-labeled subset, converts these point annotations into pseudo-box labels to train a student detector. Yet, medical imagery presents unique challenges, including overlapping anatomy, variable object sizes, and elusive structures, which hinder accurate bounding box inference. To overcome these challenges, we introduce DExTeR (DETR with Experts), a transformer-based Point-to-Box regressor tailored for medical imaging. Built upon Point-DETR, DExTeR encodes single-point annotations as object queries, refining feature extraction with the proposed class-guided deformable attention, which guides attention sampling using point coordinates and class labels to capture class-specific characteristics. To improve discrimination in complex structures, it introduces CLICK-MoE (CLass, Instance, and Common Knowledge Mixture of Experts), decoupling class and instance representations to reduce confusion among adjacent or overlapping instances. Finally, we implement a multi-point training strategy which promotes prediction consistency across different point placements, improving robustness to annotation variability. DExTeR achieves state-of-the-art performance across three datasets spanning different medical domains (endoscopy, chest X-rays, and endoscopic ultrasound) highlighting its potential to reduce annotation costs while maintaining high detection accuracy.