CountGD: Multi-Modal Open-World Counting

๐Ÿ“… 2024-07-05
๐Ÿ›๏ธ Neural Information Processing Systems
๐Ÿ“ˆ Citations: 3
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the open-vocabulary image counting task by proposing the first multimodal general-purpose counting framework that jointly supports textual descriptions and visual exemplars as prompts. To overcome limitations of closed-vocabulary and unimodal input, the method builds upon GroundingDINO for detection grounding, incorporates a visual exemplar encoding module, and introduces a cross-modal fusion attention mechanism that explicitly models synergistic and constraining relationships between textual and visual promptsโ€”enabling end-to-end differentiability. Evaluated on multiple open-world counting benchmarks, it consistently surpasses state-of-the-art methods: in text-only mode, performance matches or exceeds existing approaches; in joint text-image mode, it achieves significant gains. The code and an interactive demo are publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
The goal of this paper is to improve the generality and accuracy of open-vocabulary object counting in images. To improve the generality, we repurpose an open-vocabulary detection foundation model (GroundingDINO) for the counting task, and also extend its capabilities by introducing modules to enable specifying the target object to count by visual exemplars. In turn, these new capabilities - being able to specify the target object by multi-modalites (text and exemplars) - lead to an improvement in counting accuracy. We make three contributions: First, we introduce the first open-world counting model, CountGD, where the prompt can be specified by a text description or visual exemplars or both; Second, we show that the performance of the model significantly improves the state of the art on multiple counting benchmarks - when using text only, CountGD is comparable to or outperforms all previous text-only works, and when using both text and visual exemplars, we outperform all previous models; Third, we carry out a preliminary study into different interactions between the text and visual exemplar prompts, including the cases where they reinforce each other and where one restricts the other. The code and an app to test the model are available at https://www.robots.ox.ac.uk/~vgg/research/countgd/.
Problem

Research questions and friction points this paper is trying to address.

Improve open-vocabulary object counting generality and accuracy.
Enable target object specification via text and visual exemplars.
Enhance counting performance using multi-modal prompts.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Repurposed GroundingDINO for counting tasks
Introduced multi-modal target specification
Enhanced accuracy with text and visual prompts