🤖 AI Summary
In multi-object tracking, multi-camera viewpoint transformations induce geometric distortions in bird’s-eye view (BEV) projections, severely degrading pedestrian appearance feature robustness and cross-view identity consistency. To address this, we propose an end-to-end trainable early-fusion multi-view tracking framework: first, geometric calibration and BEV feature alignment mitigate viewpoint-induced distortion; second, we introduce cross-frame cross-attention—novelly embedded within the early-fusion pipeline—to enable adaptive spatiotemporal and inter-view feature propagation; third, detection, association, and re-identification are jointly modeled in a unified architecture. Our method achieves state-of-the-art IDF1 scores of 96.1% on WildTrack and 85.7% on MultiviewX. Key contributions include: (i) a cross-frame cross-attention mechanism for joint temporal and inter-view feature alignment; and (ii) a distortion-robust early-fusion tracking paradigm.
📝 Abstract
In spite of the recent advancements in multi-object tracking, occlusion poses a significant challenge. Multi-camera setups have been used to address this challenge by providing a comprehensive coverage of the scene. Recent multi-view pedestrian detection models have highlighted the potential of an early-fusion strategy, projecting feature maps of all views to a common ground plane or the Bird's Eye View (BEV), and then performing detection. This strategy has been shown to improve both detection and tracking performance. However, the perspective transformation results in significant distortion on the ground plane, affecting the robustness of the appearance features of the pedestrians. To tackle this limitation, we propose a novel model that incorporates attention mechanisms in a multi-view pedestrian tracking scenario. Our model utilizes an early-fusion strategy for detection, and a cross-attention mechanism to establish robust associations between pedestrians in different frames, while efficiently propagating pedestrian features across frames, resulting in a more robust feature representation for each pedestrian. Extensive experiments demonstrate that our model outperforms state-of-the-art models, with an IDF1 score of $96.1%$ on Wildtrack dataset, and $85.7%$ on MultiviewX dataset.