🤖 AI Summary
Microelectronic defect detection faces two major challenges: severe scarcity of labeled data and the ineffectiveness of transfer learning from natural-image pretraining. Method: This paper proposes the first mask autoencoder (MAE)-driven Vision Transformer (ViT) self-supervised pretraining framework specifically designed for scanning acoustic microscopy (SAM) images. It eliminates reliance on large-scale natural-image datasets or manual annotations, enabling end-to-end learning of defect-specific representations directly in the target domain. Contribution/Results: Trained on fewer than 10,000 SAM images, our model surpasses supervised ViTs, natural-image-pretrained ViTs, and mainstream CNNs in localization accuracy and robustness—particularly for critical defects such as solder joint cracks. Interpretability analysis via attention visualization confirms that the model reliably focuses on genuine defect regions. This work establishes a generalizable self-supervised paradigm for few-shot and cross-domain industrial vision tasks.
📝 Abstract
Whereas in general computer vision, transformer-based architectures have quickly become the gold standard, microelectronics defect detection still heavily relies on convolutional neural networks (CNNs). We hypothesize that this is due to the fact that a) transformers have an increased need for data and b) labelled image generation procedures for microelectronics are costly, and labelled data is therefore sparse. Whereas in other domains, pre-training on large natural image datasets can mitigate this problem, in microelectronics transfer learning is hindered due to the dissimilarity of domain data and natural images. Therefore, we evaluate self pre-training, where models are pre-trained on the target dataset, rather than another dataset. We propose a vision transformer (ViT) pre-training framework for defect detection in microelectronics based on masked autoencoders (MAE). In MAE, a large share of image patches is masked and reconstructed by the model during pre-training. We perform pre-training and defect detection using a dataset of less than 10.000 scanning acoustic microscopy (SAM) images labelled using transient thermal analysis (TTA). Our experimental results show that our approach leads to substantial performance gains compared to a) supervised ViT, b) ViT pre-trained on natural image datasets, and c) state-of-the-art CNN-based defect detection models used in the literature. Additionally, interpretability analysis reveals that our self pre-trained models, in comparison to ViT baselines, correctly focus on defect-relevant features such as cracks in the solder material. This demonstrates that our approach yields fault-specific feature representations, making our self pre-trained models viable for real-world defect detection in microelectronics.