CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing visual anomaly detection methods are predominantly constrained to industrial defects or synthetic scenarios, failing to generalize to the diversity and unpredictability of real-world anomalies. To address this, we propose CAVE—the first cognition-driven benchmark for real-world visual anomaly detection—grounded in cognitive science principles. CAVE introduces a fine-grained, multi-level annotation schema encompassing anomaly type, severity, and prevalence, enabling open-domain reasoning tasks such as anomaly description, explanation, and justification. Its annotation framework jointly models visual localization and human perceptual mechanisms. We systematically evaluate state-of-the-art vision-language models (VLMs) under diverse prompting strategies. Experiments reveal substantial deficiencies in current VLMs’ commonsense anomaly perception and reasoning capabilities, underscoring CAVE’s rigor and necessity as a high-challenge benchmark for advancing robust, human-aligned visual anomaly understanding.

Technology Category

Application Category

📝 Abstract

Humans can naturally identify, reason about, and explain anomalies in their environment. In computer vision, this long-standing challenge remains limited to industrial defects or unrealistic, synthetically generated anomalies, failing to capture the richness and unpredictability of real-world anomalies. In this work, we introduce CAVE, the first benchmark of real-world visual anomalies. CAVE supports three open-ended tasks: anomaly description, explanation, and justification; with fine-grained annotations for visual grounding and categorizing anomalies based on their visual manifestations, their complexity, severity, and commonness. These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. We show that state-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies. By offering a realistic and cognitively grounded benchmark, CAVE serves as a valuable resource for advancing research in anomaly detection and commonsense reasoning in VLMs.

Problem

Research questions and friction points this paper is trying to address.

Detecting and explaining real-world visual commonsense anomalies

Evaluating vision-language models' anomaly perception capabilities

Addressing limitations in current synthetic anomaly detection methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-world visual anomaly benchmark creation

Cognitive science-inspired fine-grained anomaly annotations

Evaluation framework for vision-language model reasoning

🔎 Similar Papers

VCD: Knowledge Base Guided Visual Commonsense Discovery in Images