🤖 AI Summary
In team sports multi-object tracking (MOT), severe identity switches and localization errors arise from high-speed motion, frequent occlusions, and highly nonlinear trajectories. Existing approaches—relying primarily on detection outputs and appearance matching—struggle under ambiguous appearances and non-linear motion dynamics. To address these challenges, we propose a Mamba-Attention hybrid architecture: (1) a state-space model (Mamba) captures long-range, nonlinear motion dependencies; (2) attention-enhanced embeddings and depth-aware adaptive spatial association metrics mitigate scale mismatch and ID fragmentation; and (3) a dynamic detection search buffer improves robustness against detection failures. Our method achieves state-of-the-art performance on SportsMOT. Moreover, with zero-shot transfer to the VIP-HTD ice hockey dataset—despite domain shift in camera setup, player appearance, and motion patterns—it maintains strong generalization, validating both architectural versatility and practical applicability in real-world sports analytics.
📝 Abstract
Multi-object tracking (MOT) in team sports is particularly challenging due to the fast-paced motion and frequent occlusions resulting in motion blur and identity switches, respectively. Predicting player positions in such scenarios is particularly difficult due to the observed highly non-linear motion patterns. Current methods are heavily reliant on object detection and appearance-based tracking, which struggle to perform in complex team sports scenarios, where appearance cues are ambiguous and motion patterns do not necessarily follow a linear pattern. To address these challenges, we introduce SportMamba, an adaptive hybrid MOT technique specifically designed for tracking in dynamic team sports. The technical contribution of SportMamba is twofold. First, we introduce a mamba-attention mechanism that models non-linear motion by implicitly focusing on relevant embedding dependencies. Second, we propose a height-adaptive spatial association metric to reduce ID switches caused by partial occlusions by accounting for scale variations due to depth changes. Additionally, we extend the detection search space with adaptive buffers to improve associations in fast-motion scenarios. Our proposed technique, SportMamba, demonstrates state-of-the-art performance on various metrics in the SportsMOT dataset, which is characterized by complex motion and severe occlusion. Furthermore, we demonstrate its generalization capability through zero-shot transfer to VIP-HTD, an ice hockey dataset.