π€ AI Summary
This work addresses system-level reasoning failures in autonomous driving caused by semantic anomaliesβe.g., incongruous object compositions. We propose an end-to-end semantic anomaly detection method that directly leverages image embeddings from Vision Foundation Models (VFMs). Our approach employs instance-segmentation-guided object-centric embedding matching to enable fine-grained semantic comparison between runtime images and a safety-scenario database, complemented by a lightweight heuristic filtering module to suppress false positives. We empirically demonstrate, for the first time, that VFM embeddings achieve semantic anomaly discrimination performance on par with GPT-4o while supporting pixel-level localization. Evaluated in CARLA simulation, our method matches GPT-4oβs detection accuracy, yet operates in real time and provides interpretable, spatially grounded anomaly explanations. This establishes a novel paradigm for enhancing semantic robustness in autonomous systems.
π Abstract
Semantic anomalies are contextually invalid or unusual combinations of familiar visual elements that can cause undefined behavior and failures in system-level reasoning for autonomous systems. This work explores semantic anomaly detection by leveraging the semantic priors of state-of-the-art vision foundation models, operating directly on the image. We propose a framework that compares local vision embeddings from runtime images to a database of nominal scenarios in which the autonomous system is deemed safe and performant. In this work, we consider two variants of the proposed framework: one using raw grid-based embeddings, and another leveraging instance segmentation for object-centric representations. To further improve robustness, we introduce a simple filtering mechanism to suppress false positives. Our evaluations on CARLA-simulated anomalies show that the instance-based method with filtering achieves performance comparable to GPT-4o, while providing precise anomaly localization. These results highlight the potential utility of vision embeddings from foundation models for real-time anomaly detection in autonomous systems.