FaultGPT: Industrial Fault Diagnosis Question Answering System by Vision Language Models

πŸ“… 2025-02-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the limited representational capacity of unimodal data in industrial machinery fault diagnosis, this paper pioneers the integration of Large Vision-Language Models (LVLMs) into end-to-end Fault Diagnosis Question Answering (FDQA), proposing a cross-modal framework: β€œvibration signal β†’ time-frequency image β†’ natural language report.” We innovatively construct a large-scale instruction-tuning dataset and design a parameter-free multi-scale cross-modal image decoder to significantly enhance fine-grained fault semantic understanding. The method eliminates the need for handcrafted feature engineering and supports zero-shot and few-shot transfer learning. Evaluated on multiple benchmark datasets, it achieves state-of-the-art performance in automated fault report generation, demonstrating both strong generalization across diverse machinery conditions and practical applicability in industrial settings.

Technology Category

Application Category

πŸ“ Abstract
Recently, employing single-modality large language models based on mechanical vibration signals as Tuning Predictors has introduced new perspectives in intelligent fault diagnosis. However, the potential of these methods to leverage multimodal data remains underexploited, particularly in complex mechanical systems where relying on a single data source often fails to capture comprehensive fault information. In this paper, we present FaultGPT, a novel model that generates fault diagnosis reports directly from raw vibration signals. By leveraging large vision-language models (LVLM) and text-based supervision, FaultGPT performs end-to-end fault diagnosis question answering (FDQA), distinguishing itself from traditional classification or regression approaches. Specifically, we construct a large-scale FDQA instruction dataset for instruction tuning of LVLM. This dataset includes vibration time-frequency image-text label pairs and human instruction-ground truth pairs. To enhance the capability in generating high-quality fault diagnosis reports, we design a multi-scale cross-modal image decoder to extract fine-grained fault semantics and conducted instruction tuning without introducing additional training parameters into the LVLM. Extensive experiments, including fault diagnosis report generation, few-shot and zero-shot evaluation across multiple datasets, validate the superior performance and adaptability of FaultGPT in diverse industrial scenarios.
Problem

Research questions and friction points this paper is trying to address.

Industrial fault diagnosis using multimodal data
End-to-end fault diagnosis question answering
Enhancing fault report generation with vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages Vision-Language Models
Generates fault diagnosis reports
Multi-scale cross-modal image decoder
πŸ”Ž Similar Papers
No similar papers found.
J
Jiao Chen
Shien-Ming Wu School of Intelligent Engineering, South China University of Technology, Guangzhou 511442, China
Ruyi Huang
Ruyi Huang
City University of Hong Kong
Fault DiagnosisPHMDeep LearningTransfer LearningMachine Learning
Z
Zuohong Lv
Shien-Ming Wu School of Intelligent Engineering, South China University of Technology, Guangzhou 511442, China
Jianhua Tang
Jianhua Tang
Shien-Ming Wu School of Intelligent Engineering, South China University of Technology
6GEdge ComputingNetwork SlicingIndustrial Internet of ThingsIndustrial AI
W
Weihua Li
School of Mechanical and Automotive Engineering, South China University of Technology, Guangzhou 510641, China; Pazhou Lab, Guangzhou 510335, China