🤖 AI Summary
This work addresses the challenge of limited real anomalous samples in industrial visual inspection, which hinders the performance of anomaly detection methods. Existing synthesis approaches often suffer from poorly integrated anomalies or inaccurate masks, limiting their effectiveness. To overcome these issues, the authors propose a spatially guided diffusion model framework that leverages semantic maps to precisely control the location and morphology of synthesized anomalies. By incorporating a spatial conditioning module and a gated self-attention mechanism into a frozen U-Net, the method enables pixel-level semantic guidance and efficient conditioning while preserving pre-trained priors and supporting few-shot adaptation. The approach generates high-quality anomalous samples on MVTec AD and VisA benchmarks and achieves state-of-the-art performance in anomaly detection, segmentation, and instance-level tasks.
📝 Abstract
The performance of visual anomaly inspection in industrial quality control is often constrained by the scarcity of real anomalous samples. Consequently, anomaly synthesis techniques have been developed to enlarge training sets and enhance downstream inspection. However, existing methods either suffer from poor integration caused by inpainting or fail to provide accurate masks. To address these limitations, we propose GroundingAnomaly, a novel few-shot anomaly image generation framework. Our framework introduces a Spatial Conditioning Module that leverages per-pixel semantic maps to enable precise spatial control over the synthesized anomalies. Furthermore, a Gated Self-Attention Module is designed to inject conditioning tokens into a frozen U-Net via gated attention layers. This carefully preserves pretrained priors while ensuring stable few-shot adaptation. Extensive evaluations on the MVTec AD and VisA datasets demonstrate that GroundingAnomaly generates high-quality anomalies and achieves state-of-the-art performance across multiple downstream tasks, including anomaly detection, segmentation, and instance-level detection.