🤖 AI Summary
To address modality missingness in industrial surface defect detection caused by sensor instability, this paper proposes a robust multimodal fusion framework. Methodologically, it introduces three novel prompting mechanisms—cross-modal consistency prompting, modality-specific prompting, and missingness-aware prompting—and employs symmetric contrastive learning with text as a bridging modality to enable complementary RGB and 3D visual feature modeling. Furthermore, it integrates trimodal contrastive pretraining with adversarial text prompt generation to enhance generalization under modality missingness. Experiments demonstrate that, under a combined RGB+3D missingness rate of 0.7, the framework achieves I-AUROC and P-AUROC scores of 73.83% and 93.05%, respectively—surpassing state-of-the-art methods by 3.84% and 5.58%. It consistently outperforms existing approaches across diverse missingness patterns, establishing new benchmarks for robust multimodal industrial defect detection.
📝 Abstract
Multimodal industrial surface defect detection (MISDD) aims to identify and locate defect in industrial products by fusing RGB and 3D modalities. This article focuses on modality-missing problems caused by uncertain sensors availability in MISDD. In this context, the fusion of multiple modalities encounters several troubles, including learning mode transformation and information vacancy. To this end, we first propose cross-modal prompt learning, which includes: i) the cross-modal consistency prompt serves the establishment of information consistency of dual visual modalities; ii) the modality-specific prompt is inserted to adapt different input patterns; iii) the missing-aware prompt is attached to compensate for the information vacancy caused by dynamic modalities-missing. In addition, we propose symmetric contrastive learning, which utilizes text modality as a bridge for fusion of dual vision modalities. Specifically, a paired antithetical text prompt is designed to generate binary text semantics, and triple-modal contrastive pre-training is offered to accomplish multimodal learning. Experiment results show that our proposed method achieves 73.83% I-AUROC and 93.05% P-AUROC with a total missing rate 0.7 for RGB and 3D modalities (exceeding state-of-the-art methods 3.84% and 5.58% respectively), and outperforms existing approaches to varying degrees under different missing types and rates. The source code will be available at https://github.com/SvyJ/MISDD-MM.