Crowded Video Individual Counting Informed by Social Grouping and Spatial-Temporal Displacement Priors

📅 2026-01-03

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the challenge of video-based individual counting in crowded scenes—such as subway commutes—where existing methods suffer from inaccurate inter-frame pedestrian correspondence. To overcome this limitation, the authors propose the OMAN++ model, which innovatively replaces conventional one-to-one matching with a one-to-many matching strategy. The approach integrates social grouping cues and spatiotemporal displacement priors through a dedicated displacement prior injector and an implicit context generator, enabling joint optimization of matching, feature extraction, and training. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art techniques across multiple benchmark datasets and achieves a 38.12% reduction in counting error on the newly introduced WuhanMetroCrowd dataset.

Technology Category

Application Category

📝 Abstract

Video Individual Counting (VIC) is a recently introduced task aiming to estimate pedestrian flux from a video. It extends Video Crowd Counting (VCC) beyond the per-frame pedestrian count. In contrast to VCC that learns to count pedestrians across frames, VIC must identify co-existent pedestrians between frames, which turns out to be a correspondence problem. Existing VIC approaches, however, can underperform in congested scenes such as metro commuting. To address this, we build WuhanMetroCrowd, one of the first VIC datasets that characterize crowded, dynamic pedestrian flows. It features sparse-to-dense density levels, short-to-long video clips, slow-to-fast flow variations, front-to-back appearance changes, and light-to-heavy occlusions. To better adapt VIC approaches to crowds, we rethink the nature of VIC and recognize two informative priors: i) the social grouping prior that indicates pedestrians tend to gather in groups and ii) the spatial-temporal displacement prior that informs an individual cannot teleport physically. The former inspires us to relax the standard one-to-one (O2O) matching used by VIC to one-to-many (O2M) matching, implemented by an implicit context generator and a O2M matcher; the latter facilitates the design of a displacement prior injector, which strengthens not only O2M matching but also feature extraction and model training. These designs jointly form a novel and strong VIC baseline OMAN++. Extensive experiments show that OMAN++ not only outperforms state-of-the-art VIC baselines on the standard SenseCrowd, CroHD, and MovingDroneCrowd benchmarks, but also indicates a clear advantage in crowded scenes, with a 38.12% error reduction on our WuhanMetroCrowd dataset. Code, data, and pretrained models are available at https://github.com/tiny-smart/OMAN.

Problem

Research questions and friction points this paper is trying to address.

Video Individual Counting

crowded scenes

pedestrian correspondence

occlusion

dynamic pedestrian flows

Innovation

Methods, ideas, or system contributions that make the work stand out.

social grouping prior

spatial-temporal displacement prior

one-to-many matching