RCCFormer: A Robust Crowd Counting Network Based on Transformer

📅 2025-04-07

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address low counting accuracy in complex scenes caused by scale variations and background clutter, this paper proposes a Transformer-based robust crowd counting network. The method introduces three key innovations: (1) an Adaptive Scale-Aware Module (ASAM) driven by input-dependent deformable convolution (IDConv) to enable dynamic receptive field modeling; (2) a Detail-Embedded Attention Block (DEAB) that integrates global-local self-attention to enhance head-structure perception; and (3) a Multi-level Feature Fusion Module (MFFM) to strengthen cross-scale semantic representation. Evaluated on four major benchmarks—ShanghaiTech Part_A, ShanghaiTech Part_B, NWPU-Crowd, and QNRF—the proposed approach achieves state-of-the-art performance, consistently outperforming existing methods in both Mean Absolute Error (MAE) and Mean Squared Error (MSE).

Technology Category

Application Category

📝 Abstract

Crowd counting, which is a key computer vision task, has emerged as a fundamental technology in crowd analysis and public safety management. However, challenges such as scale variations and complex backgrounds significantly impact the accuracy of crowd counting. To mitigate these issues, this paper proposes a robust Transformer-based crowd counting network, termed RCCFormer, specifically designed for background suppression and scale awareness. The proposed method incorporates a Multi-level Feature Fusion Module (MFFM), which meticulously integrates features extracted at diverse stages of the backbone architecture. It establishes a strong baseline capable of capturing intricate and comprehensive feature representations, surpassing traditional baselines. Furthermore, the introduced Detail-Embedded Attention Block (DEAB) captures contextual information and local details through global self-attention and local attention along with a learnable manner for efficient fusion. This enhances the model's ability to focus on foreground regions while effectively mitigating background noise interference. Additionally, we develop an Adaptive Scale-Aware Module (ASAM), with our novel Input-dependent Deformable Convolution (IDConv) as its fundamental building block. This module dynamically adapts to changes in head target shapes and scales, significantly improving the network's capability to accommodate large-scale variations. The effectiveness of the proposed method is validated on the ShanghaiTech Part_A and Part_B, NWPU-Crowd, and QNRF datasets. The results demonstrate that our RCCFormer achieves excellent performance across all four datasets, showcasing state-of-the-art outcomes.

Problem

Research questions and friction points this paper is trying to address.

Addresses scale variations in crowd counting accuracy

Reduces background noise interference in crowd analysis

Improves adaptation to head target shape and scale changes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based network for crowd counting

Multi-level Feature Fusion Module integration

Adaptive Scale-Aware Module with IDConv

🔎 Similar Papers

No similar papers found.