Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

📅 2025-05-27

📈 Citations: 2

✨ Influential: 1

career value

208K/year

🤖 AI Summary

Existing vision-language models (VLMs) exhibit strong zero-shot detection performance on common-object benchmarks but suffer severe generalization degradation on out-of-distribution categories (e.g., medical imaging), novel tasks, and heterogeneous modalities. Method: We introduce Roboflow100-VL—the first benchmark comprising 100 cross-domain, multimodal object detection datasets—designed to evaluate VLMs under long-tailed and domain-specific (e.g., clinical) zero- and few-shot settings. We propose a novel few-shot adaptation paradigm based on alignment between visual exemplars and textual instructions, accompanied by a cross-modal annotation protocol and a unified evaluation framework compatible with models including GroundingDINO and Qwen2.5-VL. Contribution/Results: Experiments reveal that state-of-the-art VLMs achieve <2% AP in zero-shot detection on medical images. All data, annotations, and code are publicly released to advance robust semantic alignment of VLMs for specialized applications.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Our code and dataset are available at https://github.com/roboflow/rf100-vl/ and https://universe.roboflow.com/rf100-vl/

Problem

Research questions and friction points this paper is trying to address.

VLMs struggle with out-of-distribution object detection tasks

Few-shot concept alignment needed for diverse multi-modal datasets

Current VLMs show poor accuracy on medical imaging datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Align VLMs with annotation instructions

Introduce Roboflow100-VL benchmark

Evaluate VLMs in diverse settings

🔎 Similar Papers

VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks