Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving

📅 2025-01-14

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

To address the out-of-distribution (OOD) detection challenge in autonomous driving vision systems—exacerbated by semantic and covariate shifts in open-world environments—this paper proposes the first unsupervised, model-agnostic input monitoring framework tailored for autonomous driving. Leveraging robust representations from vision foundation models (e.g., ViT-MAE, DINOv2), the method unifies four unsupervised density modeling techniques (KDE, GMM, normalizing flows, and VAE-based methods) to estimate feature-space densities without requiring OOD samples or downstream task fine-tuning. Key contributions include: (i) the first systematic evaluation of vision foundation models’ generalizability for OOD detection in autonomous driving; (ii) an empirical finding that model capacity—not latent dimensionality—is the dominant factor governing detection performance; and (iii) consistent superiority over state-of-the-art methods across 20 benchmarks, with average AUC improvements of 12.3%, and reliable identification of high-risk misclassifications—demonstrating its viability as a safety-critical monitoring module.

Technology Category

Application Category

📝 Abstract

Deep neural networks (DNNs) remain challenged by distribution shifts in complex open-world domains like automated driving (AD): Absolute robustness against yet unknown novel objects (semantic shift) or styles like lighting conditions (covariate shift) cannot be guaranteed. Hence, reliable operation-time monitors for identification of out-of-training-data-distribution (OOD) scenarios are imperative. Current approaches for OOD classification are untested for complex domains like AD, are limited in the kinds of shifts they detect, or even require supervision with OOD samples. To prepare for unanticipated shifts, we instead establish a framework around a principled, unsupervised, and model-agnostic method that unifies detection of all kinds of shifts: Find a full model of the training data's feature distribution, to then use its density at new points as in-distribution (ID) score. To implement this, we propose to combine the newly available Vision Foundation Models (VFM) as feature extractors with one of four alternative density modeling techniques. In an extensive benchmark of 4 VFMs against 20 baselines, we show the superior performance of VFM feature encodings compared to shift-specific OOD monitors. Additionally, we find that sophisticated architectures outperform larger latent space dimensionality; and our method identifies samples with higher risk of errors on downstream tasks, despite being model-agnostic. This suggests that VFMs are promising to realize model-agnostic, unsupervised, reliable safety monitors in complex vision tasks.

Problem

Research questions and friction points this paper is trying to address.

Autonomous Vehicles

Visual System

Performance Evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised Learning

Autonomous Driving

Visual System Adaptability

🔎 Similar Papers

Why Autonomous Vehicles Are Not Ready Yet: A Multi-Disciplinary Review of Problems, Attempted Solutions, and Future Directions