SteelDefectX: A Coarse-to-Fine Vision-Language Dataset and Benchmark for Generalizable Steel Surface Defect Detection

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the limited interpretability and generalization of existing steel surface defect detection methods, which rely solely on labeled images without rich semantic context. To bridge this gap, the authors construct a visual–language dataset comprising 7,778 images across 25 defect categories, introducing for the first time a multi-granularity textual annotation scheme that spans coarse-grained attributes (e.g., defect category, industrial cause) to fine-grained characteristics (e.g., shape, size, depth). Four benchmark tasks are established to evaluate performance under diverse settings. Leveraging vision–language joint modeling, few-shot and zero-shot learning, and cross-task transfer, the proposed baseline system achieves significant improvements in classification accuracy, generalization, and zero-shot transfer capability, thereby demonstrating the effectiveness and innovative potential of multi-granular semantic descriptions in industrial defect inspection.

Technology Category

Application Category

📝 Abstract

Steel surface defect detection is essential for ensuring product quality and reliability in modern manufacturing. Current methods often rely on basic image classification models trained on label-only datasets, which limits their interpretability and generalization. To address these challenges, we introduce SteelDefectX, a vision-language dataset containing 7,778 images across 25 defect categories, annotated with coarse-to-fine textual descriptions. At the coarse-grained level, the dataset provides class-level information, including defect categories, representative visual attributes, and associated industrial causes. At the fine-grained level, it captures sample-specific attributes, such as shape, size, depth, position, and contrast, enabling models to learn richer and more detailed defect representations. We further establish a benchmark comprising four tasks, vision-only classification, vision-language classification, few/zero-shot recognition, and zero-shot transfer, to evaluate model performance and generalization. Experiments with several baseline models demonstrate that coarse-to-fine textual annotations significantly improve interpretability, generalization, and transferability. We hope that SteelDefectX will serve as a valuable resource for advancing research on explainable, generalizable steel surface defect detection. The data will be publicly available on https://github.com/Zhaosxian/SteelDefectX.

Problem

Research questions and friction points this paper is trying to address.

steel surface defect detection

interpretability

generalization

vision-language dataset

defect annotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language dataset

coarse-to-fine annotation

steel surface defect detection