Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection

📅 2025-07-15

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address the challenges of recognizing rare categories and ambiguous reference information during batch-wise inference in zero-shot anomaly detection (ZSAD), this paper proposes FiSeCLIP—a fine-grained, CLIP-based method requiring no fine-tuning. Its core innovation lies in leveraging unlabeled images within each test batch as mutual references, integrating feature matching with cross-modal alignment, and introducing a text-guided noise filtering mechanism to uncover local semantic correlations for precise anomaly localization. By avoiding explicit training and instead recovering local semantics, FiSeCLIP enhances robustness and discriminative capability. Evaluated on the MVTec-AD benchmark, it significantly outperforms the state-of-the-art AdaCLIP, achieving improvements of +4.6% in AU-ROC and +5.7% in F1-max for segmentation—establishing a stronger, training-free baseline for ZSAD.

Technology Category

Application Category

📝 Abstract

With the advent of vision-language models (e.g., CLIP) in zero- and few-shot settings, CLIP has been widely applied to zero-shot anomaly detection (ZSAD) in recent research, where the rare classes are essential and expected in many applications. This study introduces extbf{FiSeCLIP} for ZSAD with training-free extbf{CLIP}, combining the feature matching with the cross-modal alignment. Testing with the entire dataset is impractical, while batch-based testing better aligns with real industrial needs, and images within a batch can serve as mutual reference points. Accordingly, FiSeCLIP utilizes other images in the same batch as reference information for the current image. However, the lack of labels for these references can introduce ambiguity, we apply text information to extbf{fi}lter out noisy features. In addition, we further explore CLIP's inherent potential to restore its local extbf{se}mantic correlation, adapting it for fine-grained anomaly detection tasks to enable a more accurate filtering process. Our approach exhibits superior performance for both anomaly classification and segmentation on anomaly detection benchmarks, building a stronger baseline for the direction, e.g., on MVTec-AD, FiSeCLIP outperforms the SOTA AdaCLIP by +4.6%$uparrow$/+5.7%$uparrow$ in segmentation metrics AU-ROC/$F_1$-max.

Problem

Research questions and friction points this paper is trying to address.

Zero-shot anomaly detection using CLIP without training

Cross-modal alignment and feature matching for anomaly detection

Mutual-filtering with batch images to reduce ambiguity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes batch images as mutual references

Filters noisy features using text information

Restores CLIP's local semantic correlation

🔎 Similar Papers

No similar papers found.