Customizing Visual-Language Foundation Models for Multi-modal Anomaly Detection and Reasoning

📅 2024-03-17
🏛️ arXiv.org
📈 Citations: 10
Influential: 1
📄 PDF
🤖 AI Summary
To address the limited generalization capability and poor scene adaptability of industrial anomaly detection, this paper proposes a vision-language foundation model customization framework tailored for multimodal industrial data. Methodologically, we introduce the first anomaly-aware vision-language model fine-tuning paradigm; design a multi-type expert-guided prompting mechanism integrating task descriptions, normality constraints, and reference images; and develop a unified 2D image-based encoder to align representations and enable joint reasoning across heterogeneous modalities—including images, point clouds, and videos. Experiments demonstrate substantial improvements over state-of-the-art methods in cross-modal anomaly identification, multi-object complex scene parsing, and fine-grained temporal localization. Open-sourced code further validates the framework’s strong generalization ability and engineering practicality.

Technology Category

Application Category

📝 Abstract
Anomaly detection is vital in various industrial scenarios, including the identification of unusual patterns in production lines and the detection of manufacturing defects for quality control. Existing techniques tend to be specialized in individual scenarios and lack generalization capacities. In this study, our objective is to develop a generic anomaly detection model that can be applied in multiple scenarios. To achieve this, we custom-build generic visual language foundation models that possess extensive knowledge and robust reasoning abilities as anomaly detectors and reasoners. Specifically, we introduce a multi-modal prompting strategy that incorporates domain knowledge from experts as conditions to guide the models. Our approach considers diverse prompt types, including task descriptions, class context, normality rules, and reference images. In addition, we unify the input representation of multi-modality into a 2D image format, enabling multi-modal anomaly detection and reasoning. Our preliminary studies demonstrate that combining visual and language prompts as conditions for customizing the models enhances anomaly detection performance. The customized models showcase the ability to detect anomalies across different data modalities such as images, point clouds, and videos. Qualitative case studies further highlight the anomaly detection and reasoning capabilities, particularly for multi-object scenes and temporal data. Our code is publicly available at https://github.com/Xiaohao-Xu/Customizable-VLM
Problem

Research questions and friction points this paper is trying to address.

Develop generic anomaly detection model for multiple industrial scenarios
Customize visual-language models for multi-modal anomaly detection and reasoning
Enhance detection performance using multi-modal prompting and domain knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Custom visual-language models for anomaly detection
Multi-modal prompting with expert domain knowledge
Unified 2D image input for multi-modality processing
🔎 Similar Papers
No similar papers found.