Crowded Video Individual Counting Informed by Social Grouping and Spatial-Temporal Displacement Priors

📅 2026-01-03
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of video-based individual counting in crowded scenes—such as subway commutes—where existing methods suffer from inaccurate inter-frame pedestrian correspondence. To overcome this limitation, the authors propose the OMAN++ model, which innovatively replaces conventional one-to-one matching with a one-to-many matching strategy. The approach integrates social grouping cues and spatiotemporal displacement priors through a dedicated displacement prior injector and an implicit context generator, enabling joint optimization of matching, feature extraction, and training. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art techniques across multiple benchmark datasets and achieves a 38.12% reduction in counting error on the newly introduced WuhanMetroCrowd dataset.

Technology Category

Application Category

📝 Abstract
Video Individual Counting (VIC) is a recently introduced task aiming to estimate pedestrian flux from a video. It extends Video Crowd Counting (VCC) beyond the per-frame pedestrian count. In contrast to VCC that learns to count pedestrians across frames, VIC must identify co-existent pedestrians between frames, which turns out to be a correspondence problem. Existing VIC approaches, however, can underperform in congested scenes such as metro commuting. To address this, we build WuhanMetroCrowd, one of the first VIC datasets that characterize crowded, dynamic pedestrian flows. It features sparse-to-dense density levels, short-to-long video clips, slow-to-fast flow variations, front-to-back appearance changes, and light-to-heavy occlusions. To better adapt VIC approaches to crowds, we rethink the nature of VIC and recognize two informative priors: i) the social grouping prior that indicates pedestrians tend to gather in groups and ii) the spatial-temporal displacement prior that informs an individual cannot teleport physically. The former inspires us to relax the standard one-to-one (O2O) matching used by VIC to one-to-many (O2M) matching, implemented by an implicit context generator and a O2M matcher; the latter facilitates the design of a displacement prior injector, which strengthens not only O2M matching but also feature extraction and model training. These designs jointly form a novel and strong VIC baseline OMAN++. Extensive experiments show that OMAN++ not only outperforms state-of-the-art VIC baselines on the standard SenseCrowd, CroHD, and MovingDroneCrowd benchmarks, but also indicates a clear advantage in crowded scenes, with a 38.12% error reduction on our WuhanMetroCrowd dataset. Code, data, and pretrained models are available at https://github.com/tiny-smart/OMAN.
Problem

Research questions and friction points this paper is trying to address.

Video Individual Counting
crowded scenes
pedestrian correspondence
occlusion
dynamic pedestrian flows
Innovation

Methods, ideas, or system contributions that make the work stand out.

social grouping prior
spatial-temporal displacement prior
one-to-many matching
video individual counting
crowd counting
🔎 Similar Papers
No similar papers found.
Hao Lu
Hao Lu
Associate Professor, Huazhong University of Science and Technology
Computer VisionDeep LearningPlant Phenotyping
X
Xuhui Zhu
State Key Laboratory of Multispectral Information Intelligent Processing Technology; School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China
W
Wenjing Zhang
State Key Laboratory of Multispectral Information Intelligent Processing Technology; School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China
Y
Yanan Li
Hubei Key Laboratory of Intelligent Robot; School of Computer Science & Engineering Artificial Intelligence, Wuhan Institute of Technology, Wuhan 430205, China
Xiang Bai
Xiang Bai
Huazhong University of Science and Technology (HUST)
Computer VisionOCR