🤖 AI Summary
Large-scale pretraining remains systematically unexplored in 3D medical object detection, with existing approaches predominantly leveraging 2D or natural-image pretraining—thus neglecting intrinsic 3D volumetric features.
Method: This work presents the first systematic investigation of large-scale pretraining tailored to 3D medical detection, encompassing both CNN and Transformer architectures under three paradigms: voxel/image reconstruction-based self-supervised learning, supervised learning, and contrastive learning.
Contribution/Results: Self-supervised pretraining via volumetric or slice-wise reconstruction consistently yields substantial gains in detection performance (e.g., mAP), whereas contrastive learning shows no stable improvement. Pretraining delivers consistent mAP improvements across diverse 3D medical benchmarks—including LiTS and BTCV—bridging a critical gap in pretraining research for detection relative to segmentation. All code is publicly released.
📝 Abstract
Large-scale pre-training holds the promise to advance 3D medical object detection, a crucial component of accurate computer-aided diagnosis. Yet, it remains underexplored compared to segmentation, where pre-training has already demonstrated significant benefits. Existing pre-training approaches for 3D object detection rely on 2D medical data or natural image pre-training, failing to fully leverage 3D volumetric information. In this work, we present the first systematic study of how existing pre-training methods can be integrated into state-of-the-art detection architectures, covering both CNNs and Transformers. Our results show that pre-training consistently improves detection performance across various tasks and datasets. Notably, reconstruction-based self-supervised pre-training outperforms supervised pre-training, while contrastive pre-training provides no clear benefit for 3D medical object detection. Our code is publicly available at: https://github.com/MIC-DKFZ/nnDetection-finetuning.