ProgRoCC: A Progressive Approach to Rough Crowd Counting

📅 2025-04-18

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

In large-scale crowd scenes, per-instance annotation is prohibitively expensive, and enumeration-based counting methods suffer from poor scalability. Method: This paper proposes a novel “coarse crowd counting” paradigm that requires only image-level coarse-count interval labels (e.g., “50–100 people”), eliminating costly fine-grained annotations. We introduce a progressive “coarse-to-fine” regression framework and design a CLIP-based vision-language matching adapter, which enhances feature discriminability via cross-modal key-value pair optimization. Contribution/Results: Evaluated on three benchmark datasets—ShanghaiTech, UCF-QNRF, and JHU-CROWD—our method substantially outperforms state-of-the-art semi-supervised and weakly supervised approaches, achieving higher accuracy while maintaining efficient inference. This work provides a scalable, low-cost solution for real-world crowd counting, advancing the frontier of label-efficient visual counting.

Technology Category

Application Category

📝 Abstract

As the number of individuals in a crowd grows, enumeration-based techniques become increasingly infeasible and their estimates increasingly unreliable. We propose instead an estimation-based version of the problem: we label Rough Crowd Counting that delivers better accuracy on the basis of training data that is easier to acquire. Rough crowd counting requires only rough annotations of the number of targets in an image, instead of the more traditional, and far more expensive, per-target annotations. We propose an approach to the rough crowd counting problem based on CLIP, termed ProgRoCC. Specifically, we introduce a progressive estimation learning strategy that determines the object count through a coarse-to-fine approach. This approach delivers answers quickly, outperforms the state-of-the-art in semi- and weakly-supervised crowd counting. In addition, we design a vision-language matching adapter that optimizes key-value pairs by mining effective matches of two modalities to refine the visual features, thereby improving the final performance. Extensive experimental results on three widely adopted crowd counting datasets demonstrate the effectiveness of our method.

Problem

Research questions and friction points this paper is trying to address.

Proposes rough crowd counting to replace unreliable enumeration methods

Uses progressive estimation learning for coarse-to-fine count accuracy

Enhances performance via vision-language matching adapter for feature refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive coarse-to-fine estimation learning strategy

Vision-language matching adapter for feature refinement

CLIP-based rough crowd counting with minimal annotations

🔎 Similar Papers

No similar papers found.