Rethinking Transfer Learning for Industrial Inspection: DINOv3 vs. ImageNet Pretraining Across RGB and X-ray Tasks

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This study systematically evaluates the transferability of self-supervised pretraining (specifically DINOv3 distilled features) versus traditional ImageNet-supervised pretraining on ConvNeXt and ResNet-50 backbones for industrial visual inspection tasks, including semantic segmentation, instance segmentation, and object detection, under both frozen-feature and full fine-tuning regimes. While DINOv3 demonstrates faster convergence and superior performance with full fine-tuning in RGB-based defect detection, ImageNet-supervised pretraining consistently outperforms DINOv3 across both adaptation strategies in X-ray imaging tasks. These findings highlight the critical influence of modality discrepancy—between natural RGB images and X-ray modalities—on the effectiveness of different pretraining paradigms, challenging the assumption that self-supervised methods universally generalize better in industrial settings.

📝 Abstract

Vision foundation models pretrained on web-scale data have recently shown strong transfer capabilities on many downstream tasks, but their effectiveness for industrial visual inspection remains unclear. Industrial data differ substantially from web-data and often require fine-grained dense prediction, raising the question of whether modern self-supervised pretraining can improve over the conventional transfer-learning paradigm based on supervised ImageNet initialization. In this work, we compare ConvNeXt backbones pretrained with supervised ImageNet classification or DINOv3 distillation, and relate them to the conventional ResNet-50 baseline. We evaluate semantic segmentation, instance segmentation, and object detection across four downstream datasets spanning RGB surface-defect inspection and X-ray defect detection. We further study both frozen and fully finetuned adaptation regimes. Our results show that DINOv3 offers no clear advantage in frozen transfer, but provides a stronger initialization after full finetuning on RGB tasks, yielding faster convergence and better final performance. Under X-ray modality shift, however, supervised ImageNet pretraining remains more effective in both frozen and finetuned settings. Overall, our findings suggest that modern vision foundation models are promising for supervised RGB industrial inspection, but their transferability is strongly conditioned by downstream adaptation and target modality.

Problem

Research questions and friction points this paper is trying to address.

transfer learning

industrial inspection

vision foundation models

modality shift

pretraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

transfer learning

vision foundation models

industrial inspection