🤖 AI Summary
This work addresses the challenges of domain-specific semantic deficiency and cross-modal semantic misalignment in few-shot anomaly detection, which commonly arise from reliance on natural-scene pre-trained features and shallow multimodal fusion. To overcome these limitations, the authors propose VTFusion, a novel framework that incorporates a task-adaptive visual-text feature extraction mechanism and a deep multimodal fusion module to enable fine-grained cross-modal interaction and precise pixel-level anomaly localization. Evaluated under a 2-shot setting, the method achieves image-level AUROC scores of 96.8% on MVTec AD and 86.2% on VisA. Furthermore, on a newly constructed industrial dataset of automotive plastic components, it attains an AUPRO of 93.5%, demonstrating significantly enhanced discriminability for synthetic anomalies and improved generalization in real-world industrial scenarios.
📝 Abstract
Few-shot anomaly detection (FSAD) has emerged as a critical paradigm for identifying irregularities using scarce normal references. While recent methods have integrated textual semantics to complement visual data, they predominantly rely on features pretrained on natural scenes, thereby neglecting the granular, domain-specific semantics essential for industrial inspection. Furthermore, prevalent fusion strategies often resort to superficial concatenation, failing to address the inherent semantic misalignment between visual and textual modalities, which compromises robustness against cross-modal interference. To bridge these gaps, this study proposes VTFusion, a vision-text multimodal fusion framework tailored for FSAD. The framework rests on two core designs. First, adaptive feature extractors for both image and text modalities are introduced to learn task-specific representations, bridging the domain gap between pretrained models and industrial data; this is further augmented by generating diverse synthetic anomalies to enhance feature discriminability. Second, a dedicated multimodal prediction fusion module is developed, comprising a fusion block that facilitates rich cross-modal information exchange and a segmentation network that generates refined pixel-level anomaly maps under multimodal guidance. VTFusion significantly advances FSAD performance, achieving image-level area under the receiver operating characteristics (AUROCs) of 96.8% and 86.2% in the 2-shot scenario on the MVTec AD and VisA datasets, respectively. Furthermore, VTFusion achieves an AUPRO of 93.5% on a real-world dataset of industrial automotive plastic parts introduced in this article, further demonstrating its practical applicability in demanding industrial scenarios.