OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models

📅 2026-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models lack systematic evaluation of robustness under out-of-distribution (OOD) data, hindering their safe deployment in high-stakes domains such as autonomous driving and healthcare. To address this gap, this work proposes OODBench, the first large-scale evaluation benchmark comprising 40,000 instance-level OOD samples. It leverages an automated data curation and annotation pipeline, combined with a difficulty-progressive prompting strategy that advances from easy to hard tasks, to establish a fine-grained and scalable automated assessment framework. Experimental results reveal that even state-of-the-art models suffer significant performance degradation on common categories when exposed to OOD samples, exposing critical limitations in generalization and safety. OODBench thus provides a reliable foundation for the development and evaluation of more robust vision-language models.

Technology Category

Application Category

📝 Abstract
Existing Visual-Language Models (VLMs) have achieved significant progress by being trained on massive-scale datasets, typically under the assumption that data are independent and identically distributed (IID). However, in real-world scenarios, it is often impractical to expect that all data processed by an AI system satisfy this assumption. Furthermore, failure to appropriately handle out-of-distribution (OOD) objects may introduce safety risks in real-world applications (e.g., autonomous driving or medical assistance). Unfortunately, current research has not yet provided valid benchmarks that can comprehensively assess the performance of VLMs in response to OOD data. Therefore, we propose OODBench, a predominantly automated method with minimal human verification, for constructing new benchmarks and evaluating the ability of VLMs to process OOD data. OODBench contains 40K instance-level OOD instance-category pairs, and we show that current VLMs still exhibit notable performance degradation on OODBench, even when the underlying image categories are common. In addition, we propose a reliable automated assessment metric that employs a Basic-to-Advanced Progression of prompted questions to assess the impact of OOD data on questions of varying difficulty more fully. Lastly, we summarize substantial findings and insights to facilitate future research in the acquisition and evaluation of OOD data.
Problem

Research questions and friction points this paper is trying to address.

out-of-distribution
vision-language models
benchmark
distribution shift
model robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

out-of-distribution
vision-language models
automated benchmark
OOD evaluation
prompt progression
Ling Lin
Ling Lin
Intuitive Surgical Inc; AcuFocus Inc; UC Irvine
RoboticsClinical ResearchVision ScienceOphthalmologyCognitive Sciences
Yang Bai
Yang Bai
A*STAR, Singapore
machine learning
Heng Su
Heng Su
Tsinghua University
super-resolutioncomputer visionimage processing
Congcong Zhu
Congcong Zhu
USTC
Multimedia Understanding
Y
Yaoxing Wang
Unmanned System Research Institute, Northwestern Polytechnical University, Xi'an, China
Y
Yang Zhou
IHPC, A*STAR, Singapore
Huazhu Fu
Huazhu Fu
Principal Scientist, IHPC, A*STAR
Medical Image AnalysisAI for HealthcareMedical AITrustworthy AI
J
Jingrun Chen
University of Science and Technology of China, Hefei, China, Suzhou Institute for Advanced Research, USTC, Suzhou, China, Key Laboratory of the Ministry of Education for Mathematical Foundations and Applications of Digital Technology, Suzhou, China