AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models

πŸ“… 2026-03-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of zero-shot visual anomaly segmentation, where existing methods struggle to accurately localize anomalies due to the abstract nature of anomalous concepts, the absence of stable visual prototypes, and insufficient alignment between semantic and pixel-level representations. To overcome these limitations, we propose AG-VAS, a novel anchor-guided unified segmentation framework that concretizes abstract anomaly semantics into spatial entities through learnable semantic anchorsβ€”[SEG], [NOR], and [ANO]. The framework leverages a Semantic-Pixel Alignment Module (SPAM) and an Anchor-Guided Mask Decoder (AGMD) to model contextual contrasts between normal and anomalous patterns. Additionally, we introduce Anomaly-Instruct20K, a structured instruction dataset tailored for fine-tuning large vision-language models. Extensive experiments on six industrial and medical benchmarks demonstrate that AG-VAS significantly improves zero-shot anomaly localization accuracy, consistently outperforming prior methods.

Technology Category

Application Category

πŸ“ Abstract
Large multimodal models (LMMs) exhibit strong task generalization capabilities, offering new opportunities for zero-shot visual anomaly segmentation (ZSAS). However, existing LMM-based segmentation approaches still face fundamental limitations: anomaly concepts are inherently abstract and context-dependent, lacking stable visual prototypes, and the weak alignment between high-level semantic embeddings and pixel-level spatial features hinders precise anomaly localization. To address these challenges, we present AG-VAS (Anchor-Guided Visual Anomaly Segmentation), a new framework that expands the LMM vocabulary with three learnable semantic anchor tokens-[SEG], [NOR], and [ANO], establishing a unified anchor-guided segmentation paradigm. Specifically, [SEG] serves as an absolute semantic anchor that translates abstract anomaly semantics into explicit, spatially grounded visual entities (e.g., holes or scratches), while [NOR] and [ANO] act as relative anchors that model the contextual contrast between normal and abnormal patterns across categories. To further enhance cross-modal alignment, we introduce a Semantic-Pixel Alignment Module (SPAM) that aligns language-level semantic embeddings with high-resolution visual features, along with an Anchor-Guided Mask Decoder (AGMD) that performs anchor-conditioned mask prediction for precise anomaly localization. In addition, we curate Anomaly-Instruct20K, a large-scale instruction dataset that organizes anomaly knowledge into structured descriptions of appearance, shape, and spatial attributes, facilitating effective learning and integration of the proposed semantic anchors. Extensive experiments on six industrial and medical benchmarks demonstrate that AG-VAS achieves consistent state-of-the-art performance in the zero-shot setting.
Problem

Research questions and friction points this paper is trying to address.

zero-shot visual anomaly segmentation
large multimodal models
semantic-pixel alignment
anomaly localization
context-dependent anomalies
Innovation

Methods, ideas, or system contributions that make the work stand out.

anchor-guided segmentation
zero-shot visual anomaly segmentation
large multimodal models
semantic-pixel alignment
learnable semantic anchors
πŸ”Ž Similar Papers
No similar papers found.
Zhen Qu
Zhen Qu
Institude of Automation, Chinese Academy of Sciences
X
Xian Tao
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences; Casivision; Weiqiao-UCAS Science and Technology Park
X
Xiaoyi Bao
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Dingrong Wang
Dingrong Wang
Rochester Institute of Technology
machine learningreinforcement learning
S
ShiChen Qu
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Z
Zhengtao Zhang
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences; Casivision
X
Xingang Wang
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences