🤖 AI Summary
To address the challenges of recognizing rare categories and ambiguous reference information during batch-wise inference in zero-shot anomaly detection (ZSAD), this paper proposes FiSeCLIP—a fine-grained, CLIP-based method requiring no fine-tuning. Its core innovation lies in leveraging unlabeled images within each test batch as mutual references, integrating feature matching with cross-modal alignment, and introducing a text-guided noise filtering mechanism to uncover local semantic correlations for precise anomaly localization. By avoiding explicit training and instead recovering local semantics, FiSeCLIP enhances robustness and discriminative capability. Evaluated on the MVTec-AD benchmark, it significantly outperforms the state-of-the-art AdaCLIP, achieving improvements of +4.6% in AU-ROC and +5.7% in F1-max for segmentation—establishing a stronger, training-free baseline for ZSAD.
📝 Abstract
With the advent of vision-language models (e.g., CLIP) in zero- and few-shot settings, CLIP has been widely applied to zero-shot anomaly detection (ZSAD) in recent research, where the rare classes are essential and expected in many applications. This study introduces extbf{FiSeCLIP} for ZSAD with training-free extbf{CLIP}, combining the feature matching with the cross-modal alignment. Testing with the entire dataset is impractical, while batch-based testing better aligns with real industrial needs, and images within a batch can serve as mutual reference points. Accordingly, FiSeCLIP utilizes other images in the same batch as reference information for the current image. However, the lack of labels for these references can introduce ambiguity, we apply text information to extbf{fi}lter out noisy features. In addition, we further explore CLIP's inherent potential to restore its local extbf{se}mantic correlation, adapting it for fine-grained anomaly detection tasks to enable a more accurate filtering process. Our approach exhibits superior performance for both anomaly classification and segmentation on anomaly detection benchmarks, building a stronger baseline for the direction, e.g., on MVTec-AD, FiSeCLIP outperforms the SOTA AdaCLIP by +4.6%$uparrow$/+5.7%$uparrow$ in segmentation metrics AU-ROC/$F_1$-max.