LVLM-COUNT: Enhancing the Counting Ability of Large Vision-Language Models

📅 2024-12-01
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language Models (LVLMs) exhibit weak accuracy in counting numerous objects within complex images, particularly suffering from poor generalization to out-of-distribution (OOD) scenarios involving large object counts. To address this, we propose a zero-shot divide-and-conquer counting framework that enables structured visual reasoning via multi-scale region decomposition and aggregation, augmented by a repetition-avoidance object deduplication mechanism that precisely localizes and filters duplicate detections. Crucially, our method requires no fine-tuning or additional training. It is the first to achieve cross-dataset zero-shot generalization for visual counting. Evaluated on multiple counting benchmarks, it significantly outperforms state-of-the-art approaches—achieving up to a 23.6% absolute accuracy gain in OOD settings—while demonstrating markedly improved robustness and generalization. This work establishes a novel paradigm for fine-grained visual understanding in LVLMs.

Technology Category

Application Category

📝 Abstract
Counting is a fundamental operation for various visual tasks in real-life applications, requiring both object recognition and robust counting capabilities. Despite their advanced visual perception, large vision-language models (LVLMs) struggle with counting tasks, especially when the number of objects exceeds those commonly encountered during training. We enhance LVLMs' counting abilities using a divide-and-conquer approach, breaking counting problems into sub-counting tasks. Our method employs a mechanism that prevents bisecting and thus repetitive counting of objects, which occurs in a naive divide-and-conquer approach. Unlike prior methods, which do not generalize well to counting datasets they have not been trained on, our method performs well on new datasets without any additional training or fine-tuning. We demonstrate that our approach enhances the counting capability of LVLMs across various datasets and benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Large-scale Visual Language Models
Object Counting
Performance Degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

LVLM-COUNT
sub-task decomposition
counting accuracy enhancement
🔎 Similar Papers
No similar papers found.
Muhammad Fetrat Qharabagh
Muhammad Fetrat Qharabagh
PhD Student, University of Waterloo
Neural NetworksTransformersAlgorithmic Reasoning with Neural Networks
M
Mohammadreza Ghofrani
Independent Researcher
K
K. Fountoulakis
University of Waterloo