Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing approaches struggle to address the challenges of large-scale, long-duration, and highly occluded multi-view crowd tracking in real-world scenarios. This work proposes MVTrackTrans, a novel Transformer-based architecture that, for the first time, incorporates a view-ground interaction mechanism to model the geometric relationship between camera viewpoints and the ground plane, enabling end-to-end multi-view trajectory association. To facilitate research in this domain, the authors also introduce and release two large-scale real-world datasets, MVCrowdTrack and CityTrack. Experimental results demonstrate that the proposed method significantly outperforms state-of-the-art approaches under complex conditions, achieving notable improvements in both tracking accuracy and robustness, thereby advancing multi-view crowd tracking toward practical deployment.

Technology Category

Application Category

📝 Abstract

Multi-view crowd tracking estimates each person's tracking trajectories on the ground of the scene. Recent research works mainly rely on CNNs-based multi-view crowd tracking architectures, and most of them are evaluated and compared on relatively small datasets, such as Wildtrack and MultiviewX. Since these two datasets are collected in small scenes and only contain tens of frames in the evaluation stage, it is difficult for the current methods to be applied to real-world applications where scene size and occlusion are more complicated. In this paper, we propose a Transformer-based multi-view crowd tracking model, \textit{MVTrackTrans}, which adopts interactions between camera views and the ground plane for enhanced multi-view tracking performance. Besides, for better evaluation, we collect and label two large real-world multi-view tracking datasets, MVCrowdTrack and CityTrack, which contain a much larger scene size over a longer time period. Compared with existing methods on the two large and new datasets, the proposed MVTrackTrans model achieves better performance, demonstrating the advantages of the model design in dealing with large scenes. We believe the proposed datasets and model will push the frontiers of the task to more practical scenarios, and the datasets and code are available at: https://github.com/zqyq/MVTrackTrans.

Problem

Research questions and friction points this paper is trying to address.

multi-view crowd tracking

large real-world scenes

occlusion

tracking trajectories

ground plane

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer

multi-view crowd tracking

view-ground interaction