Resilient Multimodal Industrial Surface Defect Detection with Uncertain Sensors Availability

📅 2025-09-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address modality missingness in industrial surface defect detection caused by sensor instability, this paper proposes a robust multimodal fusion framework. Methodologically, it introduces three novel prompting mechanisms—cross-modal consistency prompting, modality-specific prompting, and missingness-aware prompting—and employs symmetric contrastive learning with text as a bridging modality to enable complementary RGB and 3D visual feature modeling. Furthermore, it integrates trimodal contrastive pretraining with adversarial text prompt generation to enhance generalization under modality missingness. Experiments demonstrate that, under a combined RGB+3D missingness rate of 0.7, the framework achieves I-AUROC and P-AUROC scores of 73.83% and 93.05%, respectively—surpassing state-of-the-art methods by 3.84% and 5.58%. It consistently outperforms existing approaches across diverse missingness patterns, establishing new benchmarks for robust multimodal industrial defect detection.

Technology Category

Application Category

📝 Abstract
Multimodal industrial surface defect detection (MISDD) aims to identify and locate defect in industrial products by fusing RGB and 3D modalities. This article focuses on modality-missing problems caused by uncertain sensors availability in MISDD. In this context, the fusion of multiple modalities encounters several troubles, including learning mode transformation and information vacancy. To this end, we first propose cross-modal prompt learning, which includes: i) the cross-modal consistency prompt serves the establishment of information consistency of dual visual modalities; ii) the modality-specific prompt is inserted to adapt different input patterns; iii) the missing-aware prompt is attached to compensate for the information vacancy caused by dynamic modalities-missing. In addition, we propose symmetric contrastive learning, which utilizes text modality as a bridge for fusion of dual vision modalities. Specifically, a paired antithetical text prompt is designed to generate binary text semantics, and triple-modal contrastive pre-training is offered to accomplish multimodal learning. Experiment results show that our proposed method achieves 73.83% I-AUROC and 93.05% P-AUROC with a total missing rate 0.7 for RGB and 3D modalities (exceeding state-of-the-art methods 3.84% and 5.58% respectively), and outperforms existing approaches to varying degrees under different missing types and rates. The source code will be available at https://github.com/SvyJ/MISDD-MM.
Problem

Research questions and friction points this paper is trying to address.

Addressing modality-missing issues in industrial defect detection
Fusing RGB and 3D data with uncertain sensor availability
Compensating information loss from dynamic missing modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal prompt learning for consistency
Symmetric contrastive learning with text bridge
Missing-aware prompt compensates information vacancy
Shuai Jiang
Shuai Jiang
Google
power electronics
Y
Yunfeng Ma
School of Artificial Intelligence and Robotics and the National Engineering Research Center for Robot Visual Perception and Control Technology, Hunan University, Changsha, Hunan 410082, China
J
Jingyu Zhou
School of Artificial Intelligence and Robotics and the National Engineering Research Center for Robot Visual Perception and Control Technology, Hunan University, Changsha, Hunan 410082, China
Y
Yuan Bian
School of Artificial Intelligence and Robotics and the National Engineering Research Center for Robot Visual Perception and Control Technology, Hunan University, Changsha, Hunan 410082, China
Y
Yaonan Wang
School of Artificial Intelligence and Robotics and the National Engineering Research Center for Robot Visual Perception and Control Technology, Hunan University, Changsha, Hunan 410082, China
M
Min Liu
School of Artificial Intelligence and Robotics and the National Engineering Research Center for Robot Visual Perception and Control Technology, Hunan University, Changsha, Hunan 410082, China