TIPS Over Tricks: Simple Prompts for Effective Zero-shot Anomaly Detection

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses zero-shot anomaly detection in scenarios where normal data from the target domain is unavailable, a setting in which existing CLIP-based methods suffer from spatial misalignment and insufficient sensitivity to fine-grained anomalies. To overcome these limitations, the authors propose a decoupled prompting mechanism built upon a spatially aware TIPS vision-language model: a fixed prompt facilitates image-level anomaly detection, while a learnable prompt enables pixel-level localization. Global anomaly scores are further refined through local evidence fusion, effectively bridging the distribution gap between global and local features without requiring complex auxiliary modules. Evaluated on seven industrial datasets, the method achieves consistent improvements—1.1–3.9% in image-level detection and 1.5–6.9% in pixel-level localization—significantly outperforming current state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Anomaly detection identifies departures from expected behavior in safety-critical settings. When target-domain normal data are unavailable, zero-shot anomaly detection (ZSAD) leverages vision-language models (VLMs). However, CLIP's coarse image-text alignment limits both localization and detection due to (i) spatial misalignment and (ii) weak sensitivity to fine-grained anomalies; prior work compensates with complex auxiliary modules yet largely overlooks the choice of backbone. We revisit the backbone and use TIPS-a VLM trained with spatially aware objectives. While TIPS alleviates CLIP's issues, it exposes a distributional gap between global and local features. We address this with decoupled prompts-fixed for image-level detection and learnable for pixel-level localization-and by injecting local evidence into the global score. Without CLIP-specific tricks, our TIPS-based pipeline improves image-level performance by 1.1-3.9% and pixel-level by 1.5-6.9% across seven industrial datasets, delivering strong generalization with a lean architecture. Code is available at github.com/AlirezaSalehy/Tipsomaly.

Problem

Research questions and friction points this paper is trying to address.

zero-shot anomaly detection

vision-language models

spatial misalignment

fine-grained anomalies

CLIP

Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot anomaly detection

vision-language models

spatially aware alignment