π€ AI Summary
To address inaccurate multi-class object counting under dense occlusion, this paper proposes a density-map estimation-based multi-task framework. Methodologically, it introduces a class-focusing module to suppress inter-class interference and pioneers the incorporation of region-aware loss into multi-class density estimation. Built upon the Twins Pyramid Vision Transformer (ViT) backbone, the framework integrates multi-scale decoding with a dedicated multi-class counting head and incorporates segmentation-guided auxiliary learning. Contributions include: (1) significantly improved counting accuracy in high-density, heavily occluded scenarios; and (2) successful extension of density estimation to novel application domains such as biodiversity monitoring. Experimental results demonstrate mean absolute error (MAE) reductions of 33%, 43%, and 64% on the VisDrone and iSAID datasets, respectively. Cross-domain experiments further validate the modelβs strong generalization capability.
π Abstract
Density map estimation can be used to estimate object counts in dense and occluded scenes where discrete counting-by-detection methods fail. We propose a multicategory counting framework that leverages a Twins pyramid vision-transformer backbone and a specialised multi-class counting head built on a state-of-the-art multiscale decoding approach. A two-task design adds a segmentation-based Category Focus Module, suppressing inter-category cross-talk at training time. Training and evaluation on the VisDrone and iSAID benchmarks demonstrates superior performance versus prior multicategory crowd-counting approaches (33%, 43% and 64% reduction to MAE), and the comparison with YOLOv11 underscores the necessity of crowd counting methods in dense scenes. The method's regional loss opens up multi-class crowd counting to new domains, demonstrated through the application to a biodiversity monitoring dataset, highlighting its capacity to inform conservation efforts and enable scalable ecological insights.