T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This paper addresses zero-shot text-guided image object counting—accurately counting unseen-category objects using only arbitrary textual descriptions. Existing CLIP-based methods suffer from insufficient sensitivity to textual prompts, limiting counting accuracy. To address this, we propose a novel framework featuring: (i) a hierarchical semantic calibration module and a representation-level region consistency loss; (ii) a rigorously curated FSC147 subset focused on minority classes to evaluate text sensitivity under challenging conditions; and (iii) integration with a pre-trained diffusion model, incorporating cross-attention map supervision, progressive text–image feature alignment, and region-level consistency modeling. Our method achieves significant improvements over state-of-the-art approaches across multiple benchmarks, particularly enhancing responsiveness to fine-grained textual prompts. The source code is publicly available.

Technology Category

Application Category

📝 Abstract

Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions. Existing methods typically rely on vision-language models like CLIP, but often exhibit limited sensitivity to text prompts. We present T2ICount, a diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models. While one-step denoising ensures efficiency, it leads to weakened text sensitivity. To address this challenge, we propose a Hierarchical Semantic Correction Module that progressively refines text-image feature alignment, and a Representational Regional Coherence Loss that provides reliable supervision signals by leveraging the cross-attention maps extracted from the denosing U-Net. Furthermore, we observe that current benchmarks mainly focus on majority objects in images, potentially masking models' text sensitivity. To address this, we contribute a challenging re-annotated subset of FSC147 for better evaluation of text-guided counting ability. Extensive experiments demonstrate that our method achieves superior performance across different benchmarks. Code is available at https://github.com/cha15yq/T2ICount.

Problem

Research questions and friction points this paper is trying to address.

Enhances zero-shot object counting via text descriptions

Improves text sensitivity in vision-language models

Introduces new benchmarks for text-guided counting evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based framework for zero-shot counting

Hierarchical Semantic Correction Module refines alignment

Representational Regional Coherence Loss enhances supervision

🔎 Similar Papers

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models