🤖 AI Summary
This work addresses the limited generalizability of conventional semantic segmentation models in semiconductor optical inspection, which often exhibit strong dependency on specific device types and struggle with distribution shifts or unseen devices. To overcome this, the authors propose an efficient segmentation approach that integrates self-supervised pretraining with in-context inference. Leveraging a small-scale industrial dataset, the method employs Masked Autoencoders (MAE) for self-supervised pretraining and introduces a patch-level retrieval-based segmentation mechanism that requires no additional training. By extracting dense embeddings via Vision Transformers and performing similarity-based retrieval, the model achieves rapid adaptation to new contexts. Experiments demonstrate that, under fixed fine-tuning computational budgets, the proposed method significantly outperforms both training from scratch and ImageNet pretraining; notably, for single-device images, the retrieval-based segmentation even surpasses fine-tuned performance, enabling near-instant deployment.
📝 Abstract
Segmentation models in automated optical inspection of wire-bonded semiconductors are typically device-specific and must be re-trained when new devices or distribution shifts appear. We introduce AOI-SSL, a training-efficient framework for semantic segmentation of wire-bonded semiconductors by combining small-domain self-supervised pre-training of vision transformers with in-context inference that minimizes the need of labeled examples. We pre-train SOTA self-supervised algorithms in a small industrial inspection dataset and find that Masked Autoencoders are the most effective in this small-data setting, improving downstream segmentation while reducing the labeled fine-tuning effort. We further introduce in-context, patch-level retrieval methods that predict masks directly from dense encoder embeddings with negligible additional training. We show that, in this setting, simple similarity-based retrieval performs on par with more complex attention-based aggregation used currently in the literature. Furthermore, our experiments demonstrate that self-supervised pre-training significantly improves segmentation quality compared to training from scratch and to ImageNet pre-trained backbones under a fixed fine-tuning computational budget. Finally, the results reveal that retrieval based segmentation outperforms fine-tuning when targeting single device images, allowing for near-instant adaptation to difficult samples.