🤖 AI Summary
Current vision-language models lack systematic evaluation of robustness under out-of-distribution (OOD) data, hindering their safe deployment in high-stakes domains such as autonomous driving and healthcare. To address this gap, this work proposes OODBench, the first large-scale evaluation benchmark comprising 40,000 instance-level OOD samples. It leverages an automated data curation and annotation pipeline, combined with a difficulty-progressive prompting strategy that advances from easy to hard tasks, to establish a fine-grained and scalable automated assessment framework. Experimental results reveal that even state-of-the-art models suffer significant performance degradation on common categories when exposed to OOD samples, exposing critical limitations in generalization and safety. OODBench thus provides a reliable foundation for the development and evaluation of more robust vision-language models.
📝 Abstract
Existing Visual-Language Models (VLMs) have achieved significant progress by being trained on massive-scale datasets, typically under the assumption that data are independent and identically distributed (IID). However, in real-world scenarios, it is often impractical to expect that all data processed by an AI system satisfy this assumption. Furthermore, failure to appropriately handle out-of-distribution (OOD) objects may introduce safety risks in real-world applications (e.g., autonomous driving or medical assistance). Unfortunately, current research has not yet provided valid benchmarks that can comprehensively assess the performance of VLMs in response to OOD data. Therefore, we propose OODBench, a predominantly automated method with minimal human verification, for constructing new benchmarks and evaluating the ability of VLMs to process OOD data. OODBench contains 40K instance-level OOD instance-category pairs, and we show that current VLMs still exhibit notable performance degradation on OODBench, even when the underlying image categories are common. In addition, we propose a reliable automated assessment metric that employs a Basic-to-Advanced Progression of prompted questions to assess the impact of OOD data on questions of varying difficulty more fully. Lastly, we summarize substantial findings and insights to facilitate future research in the acquisition and evaluation of OOD data.