🤖 AI Summary
To address low counting accuracy in complex scenes caused by scale variations and background clutter, this paper proposes a Transformer-based robust crowd counting network. The method introduces three key innovations: (1) an Adaptive Scale-Aware Module (ASAM) driven by input-dependent deformable convolution (IDConv) to enable dynamic receptive field modeling; (2) a Detail-Embedded Attention Block (DEAB) that integrates global-local self-attention to enhance head-structure perception; and (3) a Multi-level Feature Fusion Module (MFFM) to strengthen cross-scale semantic representation. Evaluated on four major benchmarks—ShanghaiTech Part_A, ShanghaiTech Part_B, NWPU-Crowd, and QNRF—the proposed approach achieves state-of-the-art performance, consistently outperforming existing methods in both Mean Absolute Error (MAE) and Mean Squared Error (MSE).
📝 Abstract
Crowd counting, which is a key computer vision task, has emerged as a fundamental technology in crowd analysis and public safety management. However, challenges such as scale variations and complex backgrounds significantly impact the accuracy of crowd counting. To mitigate these issues, this paper proposes a robust Transformer-based crowd counting network, termed RCCFormer, specifically designed for background suppression and scale awareness. The proposed method incorporates a Multi-level Feature Fusion Module (MFFM), which meticulously integrates features extracted at diverse stages of the backbone architecture. It establishes a strong baseline capable of capturing intricate and comprehensive feature representations, surpassing traditional baselines. Furthermore, the introduced Detail-Embedded Attention Block (DEAB) captures contextual information and local details through global self-attention and local attention along with a learnable manner for efficient fusion. This enhances the model's ability to focus on foreground regions while effectively mitigating background noise interference. Additionally, we develop an Adaptive Scale-Aware Module (ASAM), with our novel Input-dependent Deformable Convolution (IDConv) as its fundamental building block. This module dynamically adapts to changes in head target shapes and scales, significantly improving the network's capability to accommodate large-scale variations. The effectiveness of the proposed method is validated on the ShanghaiTech Part_A and Part_B, NWPU-Crowd, and QNRF datasets. The results demonstrate that our RCCFormer achieves excellent performance across all four datasets, showcasing state-of-the-art outcomes.