🤖 AI Summary
This work addresses the challenges of scale variation, occlusion, and high computational cost in dense crowd counting by proposing RepSFNet, a lightweight single-branch network. Built upon the RepLK-ViT backbone, the method leverages structural reparameterization to integrate large-kernel convolutions with multi-scale features while eliminating attention mechanisms. It introduces a novel Concatenate Fusion module and incorporates ASPP and CAN to enhance density-adaptive modeling. The model is trained using a combined loss of mean squared error (MSE) and optimal transport. Evaluated on ShanghaiTech, NWPU, and UCF-QNRF datasets, RepSFNet achieves state-of-the-art accuracy while reducing inference latency by up to 34%, offering an effective balance between high precision and real-time deployment capability on edge devices.
📝 Abstract
Crowd counting remains challenging in variable-density scenes due to scale variations, occlusions, and the high computational cost of existing models. To address this, we propose RepSFNet (Reparameterized Single Fusion Network), a lightweight architecture designed for accurate and real-time crowd estimation. RepSFNet combines large-kernel convolutional power with a efficient, suitable for low-power edge computing. The architecture includes three components: (i) a RepLK-ViT backbone using large reparameterized kernels for efficient multi-scale feature extraction; (ii) a Feature Fusion module that integrates ASPP and CAN for robust, density adaptive context modeling; and (iii) a Concatenate Fusion module to preserve spatial resolution and produce high-quality density maps. By avoiding attention mechanisms and multi-branch designs, RepSFNet reduces both parameters and FLOPs, enhancing runtime efficiency. The loss function combines Mean Squared Error (MSE) and Optimal Transport (OT), further improving count accuracy. Experiments on ShanghaiTech, NWPU, and UCF-QNRF show that RepSFNet delivers competitive accuracy with up to 34% lower inference latency compared to P2PNet, M-SFANet, M-SegNet, STEERER, and Gramformer, making it more efficient and suitable for low-power edge computing.