IndoorCrowd: A Multi-Scene Dataset for Human Detection, Segmentation, and Tracking with an Automated Annotation Pipeline

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of high-quality, multi-task annotated datasets for indoor high-density crowd scenarios, which hinders research in human detection, instance segmentation, and multi-object tracking. We introduce a large-scale video dataset captured across four campus environments, comprising 9,913 frames, including 2,552 sequentially labeled frames for multi-object tracking and 620 frames dedicated to automatic annotation evaluation. For the first time in real-world, high-density indoor settings, we provide human-verified instance-level segmentation masks and systematically evaluate the performance of SAM-family models—SAM, GroundingSAM, and EfficientGroundingSAM—in automatic annotation. Baseline results are established using YOLOv8n/v26n and RT-DETR-L detectors combined with ByteTrack, BoT-SORT, and OC-SORT trackers. Experiments reveal that the ACS-EC subset, characterized by extreme crowd density and small object scales, presents the greatest challenge, offering a robust benchmark for future algorithm development.
📝 Abstract
Understanding human behaviour in crowded indoor environments is central to surveillance, smart buildings, and human-robot interaction, yet existing datasets rarely capture real-world indoor complexity at scale. We introduce IndoorCrowd, a multi-scene dataset for indoor human detection, instance segmentation, and multi-object tracking, collected across four campus locations (ACS-EC, ACS-EG, IE-Central, R-Central). It comprises $31$ videos ($9{,}913$ frames at $5$fps) with human-verified, per-instance segmentation masks. A $620$-frame control subset benchmarks three foundation-model auto-annotators: SAM3, GroundingSAM, and EfficientGroundingSAM, against human labels using Cohen's $κ$, AP, precision, recall, and mask IoU. A further $2{,}552$-frame subset supports multi-object tracking with continuous identity tracks in MOTChallenge format. We establish detection, segmentation, and tracking baselines using YOLOv8n, YOLOv26n, and RT-DETR-L paired with ByteTrack, BoT-SORT, and OC-SORT. Per-scene analysis reveals substantial difficulty variation driven by crowd density, scale, and occlusion: ACS-EC, with $79.3\%$ dense frames and a mean instance scale of $60.8$px, is the most challenging scene. The project page is available at https://sheepseb.github.io/IndoorCrowd/.
Problem

Research questions and friction points this paper is trying to address.

indoor crowd
human detection
instance segmentation
multi-object tracking
dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

automated annotation pipeline
multi-scene indoor dataset
instance segmentation
multi-object tracking
foundation models
🔎 Similar Papers
No similar papers found.
S
Sebastian-Ion Nae
National University of Science and Technology Politehnica Bucharest, Romania
R
Radu Moldoveanu
National University of Science and Technology Politehnica Bucharest, Romania; Expleo, Romania
A
Alexandra Stefania Ghita
National University of Science and Technology Politehnica Bucharest, Romania
Adina Magda Florea
Adina Magda Florea
Professor of Computer Science, University Politehnica of Bucharest, Academy of Romanian Scientists
Artificial IntelligenceMachine LearningAmbient Assisted LivingAcademy of Romanian Scientists