AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Current VFM evaluation faces two key bottlenecks: (1) instruction-tuning data mismatches the VQA test distribution, and (2) VQA tasks conflate multiple visual capabilities, hindering root-cause error attribution. To address these, we propose the Atomic Visual Ability (AVA) disentanglement framework and introduce AVA-Bench—the first fine-grained benchmark covering 14 decoupled atomic capabilities (e.g., localization, depth estimation, spatial reasoning). Our framework establishes a single-capability isolation evaluation paradigm with aligned training–test distributions. Leveraging multi-scale annotation and a lightweight LLM (0.5B)-assisted evaluation protocol, we precisely characterize VFMs’ “capability fingerprints.” Experiments show that the 0.5B LLM achieves near-identical evaluation outcomes compared to a 7B model (Spearman ρ > 0.98), while reducing GPU inference time by 8×. This work provides a reproducible, attributable, and interpretable infrastructure for diagnostic VFM assessment.

Technology Category

Application Category

📝 Abstract

The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general-purpose heads, followed by evaluation on broad Visual Question Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i) the instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than a VFM' visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities, making it hard to tell whether errors stem from lacking all required abilities or just a single critical one. To address these gaps, we introduce AVA-Bench, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs) -- foundational skills like localization, depth estimation, and spatial understanding that collectively support complex visual reasoning tasks. By decoupling AVAs and matching training and test distributions within each, AVA-Bench pinpoints exactly where a VFM excels or falters. Applying AVA-Bench to leading VFMs thus reveals distinctive"ability fingerprints,"turning VFM selection from educated guesswork into principled engineering. Notably, we find that a 0.5B LLM yields similar VFM rankings as a 7B LLM while cutting GPU hours by 8x, enabling more efficient evaluation. By offering a comprehensive and transparent benchmark, we hope AVA-Bench lays the foundation for the next generation of VFMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluates vision foundation models on specific visual abilities

Addresses misalignment between training data and test distributions

Identifies individual visual skill deficiencies in complex tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangles 14 Atomic Visual Abilities

Matches training and test distributions

Enables efficient evaluation with smaller LLMs

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?