CLIP-EBC: CLIP Can Count Accurately through Enhanced Blockwise Classification

📅 2024-03-14
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF

career value

207K/year
🤖 AI Summary
This work addresses the challenge of adapting CLIP—originally designed for contrastive learning—to regression-based crowd counting. We propose the first fully CLIP-based end-to-end framework, reformulating counting as Enhanced Block-wise Classification (EBC): integer-range binning mitigates boundary ambiguity inherent in conventional quantization, while joint optimization with density map regression loss refines prediction accuracy. Our method eliminates task-specific detection or regression backbones, enabling, for the first time, zero-shot transfer and end-to-end training of CLIP for crowd counting. On UCF-QNRF, EBC reduces MAE by 44.5%; on the NWPU-Crowd test set, it achieves state-of-the-art performance (MAE = 58.2, RMSE = 268.5), outperforming STEERER by 8.6% and 13.3%, respectively. These results validate CLIP’s effectiveness and generalizability for density estimation tasks.

Technology Category

Application Category

📝 Abstract
We propose CLIP-EBC, the first fully CLIP-based model for accurate crowd density estimation. While the CLIP model has demonstrated remarkable success in addressing recognition tasks such as zero-shot image classification, its potential for counting has been largely unexplored due to the inherent challenges in transforming a regression problem, such as counting, into a recognition task. In this work, we investigate and enhance CLIP's ability to count, focusing specifically on the task of estimating crowd sizes from images. Existing classification-based crowd-counting frameworks have significant limitations, including the quantization of count values into bordering real-valued bins and the sole focus on classification errors. These practices result in label ambiguity near the shared borders and inaccurate prediction of count values. Hence, directly applying CLIP within these frameworks may yield suboptimal performance. To address these challenges, we first propose the Enhanced Blockwise Classification (EBC) framework. Unlike previous methods, EBC utilizes integer-valued bins, effectively reducing ambiguity near bin boundaries. Additionally, it incorporates a regression loss based on density maps to improve the prediction of count values. Within our backbone-agnostic EBC framework, we then introduce CLIP-EBC to fully leverage CLIP's recognition capabilities for this task. Extensive experiments demonstrate the effectiveness of EBC and the competitive performance of CLIP-EBC. Specifically, our EBC framework can improve existing classification-based methods by up to 44.5% on the UCF-QNRF dataset, and CLIP-EBC achieves state-of-the-art performance on the NWPU-Crowd test set, with an MAE of 58.2 and an RMSE of 268.5, representing improvements of 8.6% and 13.3% over the previous best method, STEERER. The code and weights are available at https://github.com/Yiming-M/CLIP-EBC.
Problem

Research questions and friction points this paper is trying to address.

Enhance CLIP's ability for accurate crowd counting
Reduce label ambiguity in classification-based counting methods
Improve count prediction using density map regression loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced Blockwise Classification reduces bin boundary ambiguity
Incorporates regression loss for accurate count prediction
Leverages CLIP's recognition capabilities for crowd counting