AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the challenge of cross-domain visual anomaly detection, where discrepancies in anomaly definitions, data modalities, and annotation standards hinder model generalization. The authors propose a training-free agent framework that formulates anomaly judgment as an iterative refutation process: in each round, candidate anomalies are generated and verified against normal reference samples using a toolbox of 13 vision-language tools. The approach enhances the reasoning capabilities of vision-language models by integrating frozen expert probes with an unsupervised self-evolution rule mechanism. Evaluated on the CrossDomainVAD-12 benchmark, the method improves average AUROC by 3.52–7.93 percentage points over single-step inference with models such as Qwen3.5-VL-27B, with the self-evolution mechanism contributing an additional gain of 2.09 percentage points.

📝 Abstract

Visual anomaly detection (VAD) is crucial in many real-world fields, such as industrial inspection, medical imaging, infrastructure monitoring, and remote sensing. However, the specific anomaly definitions, data modalities, and annotation standards across different domains make it difficult to transfer single-domain trained VAD models. Vision-language models (VLMs), pre-trained on large-scale cross-domain data, can perform visual perception under task instructions, offering a promising solution for cross-domain VAD. However, single-inference VLM judgments are unreliable, since they rely more on prior knowledge than on normal-sample references or fine-grained feature evidence. We therefore present AnomalyClaw, a training-free VAD agent that turns anomaly judgment into a multi-round refutation process. In each round, the agent proposes candidate anomalies and refutes each against normal-sample references, drawing on a 13-tool library for visual verification, reference parsing, and frozen expert probing. On the CrossDomainVAD-12 benchmark (12 datasets), AnomalyClaw achieves consistent macro-AUROC improvements over single-step direct inference with +6.23 pp on GPT-5.5, +7.93 pp on Seed2.0-lite, and +3.52 pp on Qwen3.5-VL-27B. We further introduce an optional verbalized self-evolution extension. It builds an online rulebook from internal-branch disagreement without oracle labels. On Qwen3.5-VL-27B, it delivers a +2.09 pp mean gain, comparable to a K = 10 oracle-label supervised baseline (+1.99 pp). These results show that agentic refutation improve anomaly understanding and reasoning of VLMs, rather than merely aggregating tool outputs.

Problem

Research questions and friction points this paper is trying to address.

visual anomaly detection

cross-domain

vision-language models

anomaly definition

annotation standards

Innovation

Methods, ideas, or system contributions that make the work stand out.

tool-grounded refutation

visual anomaly detection

vision-language models