🤖 AI Summary
This work addresses the absence of a unified benchmark for evaluating counting capabilities in multimodal large language models (MLLMs) across image, text, and audio modalities. To this end, we introduce UNICBench—the first comprehensive evaluation benchmark and toolkit that supports unified counting assessment across images, documents, and audio. Our contributions include a three-tier capability taxonomy with difficulty labels, modality-specific matching rules, a standardized evaluation protocol, and precisely annotated data with deterministic numerical parsing. Evaluations of 45 state-of-the-art MLLMs reveal that while models perform reasonably on basic counting tasks, they exhibit significant shortcomings in reasoning-intensive and high-difficulty scenarios, exposing long-tail failure patterns and highlighting key directions for improvement. The benchmark and toolkit are publicly released to support reproducible and reliable future research.
📝 Abstract
Counting is a core capability for multimodal large language models (MLLMs), yet there is no unified counting dataset to rigorously evaluate this ability across image, text, and audio. We present UNICBench, a unified multimodal, multi level counting benchmark and evaluation toolkit with accurate ground truth, deterministic numeric parsing, and stratified reporting. The corpus comprises 5,300 images (5,508 QA), 872 documents (5,888 QA), and 2,069 audio clips (2,905 QA), annotated with a three level capability taxonomy and difficulty tags. Under a standardized protocol with fixed splits/prompts/seeds and modality specific matching rules, we evaluate 45 state-of-the-art MLLMs across modalities. Results show strong performance on some basic counting tasks but significant gaps on reasoning and the hardest partitions, highlighting long-tail errors and substantial headroom for improving general counting. UNICBench offers a rigorous and comparable basis for measurement and a public toolkit to accelerate progress.