π€ AI Summary
This work addresses the challenge of efficiently modeling long-range dependencies in dense multi-object tracking, where conventional attention mechanisms suffer from high computational complexity and degraded performance under crowded or occluded conditions. To overcome this, the authors propose GateMOT, an online tracking framework featuring Q-Gated Attentionβa novel mechanism that reformulates query vectors as learnable gating units to modulate key features element-wise. This design enables linear-complexity, explicit, and spatially aware correlation modeling. GateMOT further supports unified representation learning across detection, motion estimation, and re-identification tasks. Evaluated on the BEE24 benchmark, it achieves state-of-the-art performance with HOTA 48.4, MOTA 67.8, and IDF1 64.5, demonstrating strong generalization capability.
π Abstract
While large models demonstrate the strong representational power of vanilla attention, this core mechanism cannot be directly applied to Dense Object Tracking: its quadratic all-to-all interactions are computationally prohibitive for dense motion estimation on high-resolution features. This mismatch prevents Dense Object Tracking from fully leveraging attention-based modeling in crowded and occlusion-heavy scenes. To address this challenge, we introduce GateMOT, an online tracking framework centered on Q-Gated Attention (Q-Attention), an efficient and spatially aware attention variant. Our key idea is to repurpose the Query from a similarity-conditioning term into a learnable gating unit. This Gating-Query (Gating-Q) produces a probabilistic gate that modulates Key features in an element-wise manner, enabling explicit relevance selection instead of costly global aggregation. Built on this mechanism, parallel Q-Attention heads transform one shared feature map into task-specific yet consistent representations for detection, motion, and re-identification, yielding a tightly coupled multi-task decoder with linear-complexity gating operations. GateMOT achieves state-of-the-art HOTA of 48.4, MOTA of 67.8, and IDF1 of 64.5 on BEE24, and demonstrates strong performance on additional Dense Object Tracking benchmarks. These results show that Q-Attention is a simple, effective, and transferable building block for attention-based tracking in dense tracking scenarios.