🤖 AI Summary
This work addresses the challenge of instance segmentation in dense crowd scenes where only sparse point annotations are available and instance masks are absent. The authors propose a novel approach that leverages the Segment Anything Model (SAM) combined with a nearest-neighbor exclusive-circle constraint to efficiently transform sparse points into high-quality instance masks. Furthermore, they introduce a reinforcement-based point selection mechanism guided by Group Relative Policy Optimization (GRPO) to refine initial point predictions and generate superior segmentation outcomes. This method achieves, for the first time, highly accurate point-to-mask conversion and incorporates a new mask-supervised loss function. It attains state-of-the-art instance segmentation performance on ShanghaiTech, UCF-QNRF, JHU-CROWD++, and NWPU-Crowd benchmarks, significantly improving crowd counting accuracy.
📝 Abstract
Crowd instance segmentation is a crucial task with a wide range of applications, including surveillance and transportation. Currently, point labels are common in crowd datasets, while region labels (e.g., boxes) are rare and inaccurate. The masks obtained through segmentation help to improve the accuracy of region labels and resolve the correspondence between individual location coordinates and crowd density maps. However, directly applying currently popular large foundation models such as SAM does not yield ideal results in dense crowds. To this end, we first propose Dense Point-to-Mask Optimization (DPMO), which integrates SAM with the Nearest Neighbor Exclusive Circle (NNEC) constraint to generate dense instance segmentation from point annotations. With DPMO and manual correction, we obtain mask annotations from the existing point annotations for traditional crowd datasets. Then, to predict instance segmentation in dense crowds, we propose a Reinforced Point Selection (RPS) framework trained with Group Relative Policy Optimization (GRPO), which selects the best predicted point from a sampling of the initial point prediction. Through extensive experiments, we achieve state-of-the-art crowd instance segmentation performance on ShanghaiTech, UCF-QNRF, JHU-CROWD++, and NWPU-Crowd datasets. Furthermore, we design new loss functions supervised by masks that boost counting performance across different models, demonstrating the significant role of mask annotations in enhancing counting accuracy.