🤖 AI Summary
Large Vision-Language Models (LVLMs) exhibit weak accuracy in counting numerous objects within complex images, particularly suffering from poor generalization to out-of-distribution (OOD) scenarios involving large object counts. To address this, we propose a zero-shot divide-and-conquer counting framework that enables structured visual reasoning via multi-scale region decomposition and aggregation, augmented by a repetition-avoidance object deduplication mechanism that precisely localizes and filters duplicate detections. Crucially, our method requires no fine-tuning or additional training. It is the first to achieve cross-dataset zero-shot generalization for visual counting. Evaluated on multiple counting benchmarks, it significantly outperforms state-of-the-art approaches—achieving up to a 23.6% absolute accuracy gain in OOD settings—while demonstrating markedly improved robustness and generalization. This work establishes a novel paradigm for fine-grained visual understanding in LVLMs.
📝 Abstract
Counting is a fundamental operation for various visual tasks in real-life applications, requiring both object recognition and robust counting capabilities. Despite their advanced visual perception, large vision-language models (LVLMs) struggle with counting tasks, especially when the number of objects exceeds those commonly encountered during training. We enhance LVLMs' counting abilities using a divide-and-conquer approach, breaking counting problems into sub-counting tasks. Our method employs a mechanism that prevents bisecting and thus repetitive counting of objects, which occurs in a naive divide-and-conquer approach. Unlike prior methods, which do not generalize well to counting datasets they have not been trained on, our method performs well on new datasets without any additional training or fine-tuning. We demonstrate that our approach enhances the counting capability of LVLMs across various datasets and benchmarks.