Towards Training-free Anomaly Detection with Vision and Language Foundation Models

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the underexplored problem of compositional anomalies—abnormal configurations arising from logical constraints among multiple elements—in unsupervised anomaly detection. We propose LogSAD, the first training-free multimodal joint detection framework for this task. Methodologically, LogSAD introduces a novel “reasoning-matching” architecture and a multi-granularity calibration ensemble mechanism to jointly model compositional and local structural anomalies. It integrates large vision-language models (e.g., GPT-4V), patch-level visual features, interest-set modeling, cross-modal semantic alignment, and score calibration. Evaluated on multiple industrial visual inspection benchmarks, LogSAD achieves state-of-the-art performance—outperforming supervised methods—while requiring zero training. The framework is fully unsupervised, lightweight, and generalizable across domains. Code is publicly available.

Technology Category

Application Category

📝 Abstract
Anomaly detection is valuable for real-world applications, such as industrial quality inspection. However, most approaches focus on detecting local structural anomalies while neglecting compositional anomalies incorporating logical constraints. In this paper, we introduce LogSAD, a novel multi-modal framework that requires no training for both Logical and Structural Anomaly Detection. First, we propose a match-of-thought architecture that employs advanced large multi-modal models (i.e. GPT-4V) to generate matching proposals, formulating interests and compositional rules of thought for anomaly detection. Second, we elaborate on multi-granularity anomaly detection, consisting of patch tokens, sets of interests, and composition matching with vision and language foundation models. Subsequently, we present a calibration module to align anomaly scores from different detectors, followed by integration strategies for the final decision. Consequently, our approach addresses both logical and structural anomaly detection within a unified framework and achieves state-of-the-art results without the need for training, even when compared to supervised approaches, highlighting its robustness and effectiveness. Code is available at https://github.com/zhang0jhon/LogSAD.
Problem

Research questions and friction points this paper is trying to address.

Detects both logical and structural anomalies without training
Uses multi-modal models for compositional anomaly detection
Aligns anomaly scores from different detectors for final decision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free multi-modal framework for anomaly detection
Match-of-thought architecture with GPT-4V
Multi-granularity detection with vision-language models
🔎 Similar Papers
No similar papers found.
Jinjin Zhang
Jinjin Zhang
Beihang University
Guodong Wang
Guodong Wang
Massachusetts College of Liberal Arts
Y
Yizhou Jin
School of Computer Science and Engineering, Beihang University, Beijing 100191, China
D
Di Huang
State Key Laboratory of Complex and Critical Software Environment, Beihang University, Beijing 100191, China; School of Computer Science and Engineering, Beihang University, Beijing 100191, China