DatBench: Discriminative, Faithful, and Efficient VLM Evaluations

📅 2026-01-05
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluation methods for vision-language models (VLMs) commonly suffer from modality unfaithfulness, insufficient discriminative power, and computational inefficiency. This work is the first to systematically articulate three core desiderata for VLM evaluation: faithfulness, discriminability, and efficiency. We establish a high-quality evaluation pipeline by reformulating multiple-choice tasks as generative ones, filtering out samples amenable to blind guessing (up to 70% of instances), and correcting mislabeled examples (42% of cases). Based on this framework, we introduce DatBench-Full, encompassing 33 datasets, along with its highly discriminative subset, DatBench. Our benchmarks maintain discriminative capacity comparable to original benchmarks while achieving an average 13× and up to 50× acceleration in evaluation speed.

Technology Category

Application Category

📝 Abstract
Empirical evaluation serves as the primary compass guiding research progress in foundation models. Despite a large body of work focused on training frontier vision-language models (VLMs), approaches to their evaluation remain nascent. To guide their maturation, we propose three desiderata that evaluations should satisfy: (1) faithfulness to the modality and application, (2) discriminability between models of varying quality, and (3) efficiency in compute. Through this lens, we identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, poorly reflect downstream use cases, and saturate early as models improve; (ii) blindly solvable questions, which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of examples in certain datasets. Regarding efficiency, the computational burden of evaluating frontier models has become prohibitive: by some accounts, nearly 20% of development compute is devoted to evaluation alone. Rather than discarding existing benchmarks, we curate them via transformation and filtering to maximize fidelity and discriminability. We find that converting multiple-choice questions to generative tasks reveals sharp capability drops of up to 35%. In addition, filtering blindly solvable and mislabeled samples improves discriminative power while simultaneously reducing computational cost. We release DatBench-Full, a cleaned evaluation suite of 33 datasets spanning nine VLM capabilities, and DatBench, a discriminative subset that achieves 13x average speedup (up to 50x) while closely matching the discriminative power of the original datasets. Our work outlines a path toward evaluation practices that are both rigorous and sustainable as VLMs continue to scale.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
evaluation benchmarks
faithfulness
discriminability
computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models
evaluation benchmark
generative evaluation
dataset curation
computational efficiency
S
Siddharth Joshi
DatologyAI
H
Haoli Yin
DatologyAI
R
Rishabh Adiga
DatologyAI
R
Ricardo Monti
DatologyAI
A
Aldo Carranza
DatologyAI
A
Alex Fang
DatologyAI
A
Alvin Deng
DatologyAI
Amro Abbas
Amro Abbas
DatologyAI
Machine LearningNatural Language ProcessingComputer Vision
B
Brett Larsen
DatologyAI
C
Cody Blakeney
DatologyAI
D
Darren Teh
DatologyAI
D
David Schwab
DatologyAI
F
Fan Pan
DatologyAI
H
Haakon Mongstad
DatologyAI
Jack Urbanek
Jack Urbanek
DatologyAI
Artificial Intelligence
J
Jason Lee
DatologyAI
J
Jason Telanoff
DatologyAI
J
Josh Wills
DatologyAI
K
Kaleigh Mentzer
DatologyAI
L
Luke Merrick
DatologyAI
Parth Doshi
Parth Doshi
MS in CSE, University of California San Diego
Machine LearningComputer Vision
P
Paul Burstein
DatologyAI
Pratyush Maini
Pratyush Maini
Carnegie Mellon University
Trustworthy ML
S
Scott Loftin
DatologyAI
S
Spandan Das
DatologyAI
T
Tony Jiang
DatologyAI
V
Vineeth Dorna
DatologyAI
Z
Zhengping Wang
DatologyAI
B
Bogdan Gaza
DatologyAI
Ari S. Morcos
Ari S. Morcos
DatologyAI
Artificial IntelligenceDeep LearningMachine Learning
M
Matthew L. Leavitt
DatologyAI