AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the challenge of zero-shot visual anomaly segmentation, where existing methods struggle to accurately localize anomalies due to the abstract nature of anomalous concepts, the absence of stable visual prototypes, and insufficient alignment between semantic and pixel-level representations. To overcome these limitations, we propose AG-VAS, a novel anchor-guided unified segmentation framework that concretizes abstract anomaly semantics into spatial entities through learnable semantic anchors—[SEG], [NOR], and [ANO]. The framework leverages a Semantic-Pixel Alignment Module (SPAM) and an Anchor-Guided Mask Decoder (AGMD) to model contextual contrasts between normal and anomalous patterns. Additionally, we introduce Anomaly-Instruct20K, a structured instruction dataset tailored for fine-tuning large vision-language models. Extensive experiments on six industrial and medical benchmarks demonstrate that AG-VAS significantly improves zero-shot anomaly localization accuracy, consistently outperforming prior methods.

Technology Category

Application Category

📝 Abstract

Large multimodal models (LMMs) exhibit strong task generalization capabilities, offering new opportunities for zero-shot visual anomaly segmentation (ZSAS). However, existing LMM-based segmentation approaches still face fundamental limitations: anomaly concepts are inherently abstract and context-dependent, lacking stable visual prototypes, and the weak alignment between high-level semantic embeddings and pixel-level spatial features hinders precise anomaly localization. To address these challenges, we present AG-VAS (Anchor-Guided Visual Anomaly Segmentation), a new framework that expands the LMM vocabulary with three learnable semantic anchor tokens-[SEG], [NOR], and [ANO], establishing a unified anchor-guided segmentation paradigm. Specifically, [SEG] serves as an absolute semantic anchor that translates abstract anomaly semantics into explicit, spatially grounded visual entities (e.g., holes or scratches), while [NOR] and [ANO] act as relative anchors that model the contextual contrast between normal and abnormal patterns across categories. To further enhance cross-modal alignment, we introduce a Semantic-Pixel Alignment Module (SPAM) that aligns language-level semantic embeddings with high-resolution visual features, along with an Anchor-Guided Mask Decoder (AGMD) that performs anchor-conditioned mask prediction for precise anomaly localization. In addition, we curate Anomaly-Instruct20K, a large-scale instruction dataset that organizes anomaly knowledge into structured descriptions of appearance, shape, and spatial attributes, facilitating effective learning and integration of the proposed semantic anchors. Extensive experiments on six industrial and medical benchmarks demonstrate that AG-VAS achieves consistent state-of-the-art performance in the zero-shot setting.

Problem

Research questions and friction points this paper is trying to address.

zero-shot visual anomaly segmentation

large multimodal models

semantic-pixel alignment

anomaly localization

context-dependent anomalies

Innovation

Methods, ideas, or system contributions that make the work stand out.

anchor-guided segmentation

zero-shot visual anomaly segmentation

large multimodal models