FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of inaccurate anomaly localization in fine-grained zero-shot anomaly detection, which stem from entangled foreground-background features and coarse textual semantics. To mitigate these issues, the authors propose a multi-strategy textual representation framework coupled with a foreground-background decoupling mechanism. Specifically, they integrate End-of-Text features, global pooling, and attention-weighted text embeddings, while introducing multi-perspective soft disentanglement across identity, semantic, and spatial dimensions. This approach incorporates background suppression and semantic consistency regularization to effectively separate foreground and background features, thereby enhancing the discriminability between normal and anomalous semantic prototypes. Evaluated under a zero-shot setting, the method achieves significant improvements in both anomaly discrimination and localization accuracy in complex scenes, outperforming current state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Fine-grained anomaly detection is crucial in industrial and medical applications, but labeled anomalies are often scarce, making zero-shot detection challenging. While vision-language models like CLIP offer promising solutions, they struggle with foreground-background feature entanglement and coarse textual semantics. We propose FB-CLIP, a framework that enhances anomaly localization via multi-strategy textual representations and foreground-background separation. In the textual modality, it combines End-of-Text features, global-pooled representations, and attention-weighted token features for richer semantic cues. In the visual modality, multi-view soft separation along identity, semantic, and spatial dimensions, together with background suppression, reduces interference and improves discriminability. Semantic Consistency Regularization (SCR) aligns image features with normal and abnormal textual prototypes, suppressing uncertain matches and enlarging semantic gaps. Experiments show that FB-CLIP effectively distinguishes anomalies from complex backgrounds, achieving accurate fine-grained anomaly detection and localization under zero-shot settings.
Problem

Research questions and friction points this paper is trying to address.

fine-grained anomaly detection
zero-shot learning
foreground-background disentanglement
vision-language models
anomaly localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

foreground-background disentanglement
zero-shot anomaly detection
vision-language model
semantic consistency regularization
fine-grained localization
🔎 Similar Papers
No similar papers found.
M
Ming Hu
Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences
Y
Yongsheng Huo
Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences
M
Mingyu Dou
Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences
J
Jianfu Yin
Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences
P
Peng Zhao
Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences
Yao Wang
Yao Wang
Xi'an Jiaotong University
Machine LearningSignal ProcessingOperations ManagementNonconvex Optimization
Cong Hu
Cong Hu
Jiangnan University
deep learning、machine learning、 computer vision、pattern recognition
B
Bingliang Hu
Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences
Q
Quan Wang
Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences