Vision Foundation Model Embedding-Based Semantic Anomaly Detection

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses system-level reasoning failures in autonomous driving caused by semantic anomalies—e.g., incongruous object compositions. We propose an end-to-end semantic anomaly detection method that directly leverages image embeddings from Vision Foundation Models (VFMs). Our approach employs instance-segmentation-guided object-centric embedding matching to enable fine-grained semantic comparison between runtime images and a safety-scenario database, complemented by a lightweight heuristic filtering module to suppress false positives. We empirically demonstrate, for the first time, that VFM embeddings achieve semantic anomaly discrimination performance on par with GPT-4o while supporting pixel-level localization. Evaluated in CARLA simulation, our method matches GPT-4o’s detection accuracy, yet operates in real time and provides interpretable, spatially grounded anomaly explanations. This establishes a novel paradigm for enhancing semantic robustness in autonomous systems.

Technology Category

Application Category

📝 Abstract

Semantic anomalies are contextually invalid or unusual combinations of familiar visual elements that can cause undefined behavior and failures in system-level reasoning for autonomous systems. This work explores semantic anomaly detection by leveraging the semantic priors of state-of-the-art vision foundation models, operating directly on the image. We propose a framework that compares local vision embeddings from runtime images to a database of nominal scenarios in which the autonomous system is deemed safe and performant. In this work, we consider two variants of the proposed framework: one using raw grid-based embeddings, and another leveraging instance segmentation for object-centric representations. To further improve robustness, we introduce a simple filtering mechanism to suppress false positives. Our evaluations on CARLA-simulated anomalies show that the instance-based method with filtering achieves performance comparable to GPT-4o, while providing precise anomaly localization. These results highlight the potential utility of vision embeddings from foundation models for real-time anomaly detection in autonomous systems.

Problem

Research questions and friction points this paper is trying to address.

Detecting semantic anomalies in autonomous systems using vision embeddings

Comparing runtime images to nominal scenarios for anomaly identification

Improving robustness with filtering for precise anomaly localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages vision foundation models for semantic anomaly detection

Compares local embeddings to nominal scenario database

Uses instance segmentation and filtering for robustness

🔎 Similar Papers

Customizing Visual-Language Foundation Models for Multi-modal Anomaly Detection and Reasoning