Video Individual Counting and Tracking from Moving Drones: A Benchmark and Methods

📅 2026-01-18

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the challenge of accurately counting and tracking dense crowds in complex dynamic scenes, where existing methods relying on fixed cameras often fail. To this end, we propose a novel framework tailored for crowd analysis in videos captured by moving drones, accompanied by MovingDroneCrowd++, a large-scale dataset encompassing diverse altitudes, viewpoints, and lighting conditions. We introduce GD3A, a counting method that bypasses explicit localization, and DVTrack, a tracking mechanism that leverages pixel-level descriptor matching via density map decomposition—into shared, inflow, and outflow components—and adaptive dustbin scores based on optimal transport. Instance-level associations are then established through descriptor voting. Experiments demonstrate that our approach reduces counting error by 47.4% and improves tracking performance by 39.2% in dense and highly dynamic scenarios, significantly outperforming current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Counting and tracking dense crowds in large-scale scenes is highly challenging, yet existing methods mainly rely on datasets captured by fixed cameras, which provide limited spatial coverage and are inadequate for large-scale dense crowd analysis. To address this limitation, we propose a flexible solution using moving drones to capture videos and perform video-level crowd counting and tracking of unique pedestrians across entire scenes. We introduce MovingDroneCrowd++, the largest video-level dataset for dense crowd counting and tracking captured by moving drones, covering diverse and complex conditions with varying flight altitudes, camera angles, and illumination. Existing methods fail to achieve satisfactory performance on this dataset. To this end, we propose GD3A (Global Density Map Decomposition via Descriptor Association), a density map-based video individual counting method that avoids explicit localization. GD3A establishes pixel-level correspondences between pedestrian descriptors across consecutive frames via optimal transport with an adaptive dustbin score, enabling the decomposition of global density maps into shared, inflow, and outflow components. Building on this framework, we further introduce DVTrack, which converts descriptor-level matching into instance-level associations through a descriptor voting mechanism for pedestrian tracking. Experimental results show that our methods significantly outperform existing approaches under dense crowds and complex motion, reducing counting error by 47.4 percent and improving tracking performance by 39.2 percent.

Problem

Research questions and friction points this paper is trying to address.

crowd counting

crowd tracking

moving drones

dense crowds

video surveillance

Innovation

Methods, ideas, or system contributions that make the work stand out.

moving drones

density map decomposition

optimal transport