🤖 AI Summary
This work addresses the scarcity of high-quality, multi-task annotated datasets for indoor high-density crowd scenarios, which hinders research in human detection, instance segmentation, and multi-object tracking. We introduce a large-scale video dataset captured across four campus environments, comprising 9,913 frames, including 2,552 sequentially labeled frames for multi-object tracking and 620 frames dedicated to automatic annotation evaluation. For the first time in real-world, high-density indoor settings, we provide human-verified instance-level segmentation masks and systematically evaluate the performance of SAM-family models—SAM, GroundingSAM, and EfficientGroundingSAM—in automatic annotation. Baseline results are established using YOLOv8n/v26n and RT-DETR-L detectors combined with ByteTrack, BoT-SORT, and OC-SORT trackers. Experiments reveal that the ACS-EC subset, characterized by extreme crowd density and small object scales, presents the greatest challenge, offering a robust benchmark for future algorithm development.
📝 Abstract
Understanding human behaviour in crowded indoor environments is central to surveillance, smart buildings, and human-robot interaction, yet existing datasets rarely capture real-world indoor complexity at scale. We introduce IndoorCrowd, a multi-scene dataset for indoor human detection, instance segmentation, and multi-object tracking, collected across four campus locations (ACS-EC, ACS-EG, IE-Central, R-Central). It comprises $31$ videos ($9{,}913$ frames at $5$fps) with human-verified, per-instance segmentation masks. A $620$-frame control subset benchmarks three foundation-model auto-annotators: SAM3, GroundingSAM, and EfficientGroundingSAM, against human labels using Cohen's $κ$, AP, precision, recall, and mask IoU. A further $2{,}552$-frame subset supports multi-object tracking with continuous identity tracks in MOTChallenge format. We establish detection, segmentation, and tracking baselines using YOLOv8n, YOLOv26n, and RT-DETR-L paired with ByteTrack, BoT-SORT, and OC-SORT. Per-scene analysis reveals substantial difficulty variation driven by crowd density, scale, and occlusion: ACS-EC, with $79.3\%$ dense frames and a mean instance scale of $60.8$px, is the most challenging scene. The project page is available at https://sheepseb.github.io/IndoorCrowd/.