🤖 AI Summary
In large-scale crowd scenes, per-instance annotation is prohibitively expensive, and enumeration-based counting methods suffer from poor scalability. Method: This paper proposes a novel “coarse crowd counting” paradigm that requires only image-level coarse-count interval labels (e.g., “50–100 people”), eliminating costly fine-grained annotations. We introduce a progressive “coarse-to-fine” regression framework and design a CLIP-based vision-language matching adapter, which enhances feature discriminability via cross-modal key-value pair optimization. Contribution/Results: Evaluated on three benchmark datasets—ShanghaiTech, UCF-QNRF, and JHU-CROWD—our method substantially outperforms state-of-the-art semi-supervised and weakly supervised approaches, achieving higher accuracy while maintaining efficient inference. This work provides a scalable, low-cost solution for real-world crowd counting, advancing the frontier of label-efficient visual counting.
📝 Abstract
As the number of individuals in a crowd grows, enumeration-based techniques become increasingly infeasible and their estimates increasingly unreliable. We propose instead an estimation-based version of the problem: we label Rough Crowd Counting that delivers better accuracy on the basis of training data that is easier to acquire. Rough crowd counting requires only rough annotations of the number of targets in an image, instead of the more traditional, and far more expensive, per-target annotations. We propose an approach to the rough crowd counting problem based on CLIP, termed ProgRoCC. Specifically, we introduce a progressive estimation learning strategy that determines the object count through a coarse-to-fine approach. This approach delivers answers quickly, outperforms the state-of-the-art in semi- and weakly-supervised crowd counting. In addition, we design a vision-language matching adapter that optimizes key-value pairs by mining effective matches of two modalities to refine the visual features, thereby improving the final performance. Extensive experimental results on three widely adopted crowd counting datasets demonstrate the effectiveness of our method.