đ¤ AI Summary
To address the out-of-distribution (OOD) detection challenge in autonomous driving vision systemsâexacerbated by semantic and covariate shifts in open-world environmentsâthis paper proposes the first unsupervised, model-agnostic input monitoring framework tailored for autonomous driving. Leveraging robust representations from vision foundation models (e.g., ViT-MAE, DINOv2), the method unifies four unsupervised density modeling techniques (KDE, GMM, normalizing flows, and VAE-based methods) to estimate feature-space densities without requiring OOD samples or downstream task fine-tuning. Key contributions include: (i) the first systematic evaluation of vision foundation modelsâ generalizability for OOD detection in autonomous driving; (ii) an empirical finding that model capacityânot latent dimensionalityâis the dominant factor governing detection performance; and (iii) consistent superiority over state-of-the-art methods across 20 benchmarks, with average AUC improvements of 12.3%, and reliable identification of high-risk misclassificationsâdemonstrating its viability as a safety-critical monitoring module.
đ Abstract
Deep neural networks (DNNs) remain challenged by distribution shifts in complex open-world domains like automated driving (AD): Absolute robustness against yet unknown novel objects (semantic shift) or styles like lighting conditions (covariate shift) cannot be guaranteed. Hence, reliable operation-time monitors for identification of out-of-training-data-distribution (OOD) scenarios are imperative. Current approaches for OOD classification are untested for complex domains like AD, are limited in the kinds of shifts they detect, or even require supervision with OOD samples. To prepare for unanticipated shifts, we instead establish a framework around a principled, unsupervised, and model-agnostic method that unifies detection of all kinds of shifts: Find a full model of the training data's feature distribution, to then use its density at new points as in-distribution (ID) score. To implement this, we propose to combine the newly available Vision Foundation Models (VFM) as feature extractors with one of four alternative density modeling techniques. In an extensive benchmark of 4 VFMs against 20 baselines, we show the superior performance of VFM feature encodings compared to shift-specific OOD monitors. Additionally, we find that sophisticated architectures outperform larger latent space dimensionality; and our method identifies samples with higher risk of errors on downstream tasks, despite being model-agnostic. This suggests that VFMs are promising to realize model-agnostic, unsupervised, reliable safety monitors in complex vision tasks.