🤖 AI Summary
Existing event-based object tracking methods overlook the spatial sparsity and dynamically varying temporal density of event data, and their reliance on fixed-time-window sampling limits adaptability in complex motion scenarios. This work proposes a sparsity-aware hierarchical tracking framework that employs a three-stage Vision Transformer to separately process low-, medium-, and high-density event regions. The approach integrates a sparsity-aware Mixture-of-Experts (MoE) module and a dynamic pondering mechanism to enable adaptive feature learning across varying event densities. To our knowledge, this is the first method in event-based tracking to jointly incorporate multi-scale event density modeling, sparsity-aware MoE, and dynamic computation path selection based on input sparsity and task difficulty. It achieves state-of-the-art performance on FE240Hz, COESOT, and EventVOT benchmarks, demonstrating superior accuracy and efficiency.
📝 Abstract
Despite significant progress, RGB-based trackers remain vulnerable to challenging imaging conditions, such as low illumination and fast motion. Event cameras offer a promising alternative by asynchronously capturing pixel-wise brightness changes, providing high dynamic range and high temporal resolution. However, existing event-based trackers often neglect the intrinsic spatial sparsity and temporal density of event data, while relying on a single fixed temporal-window sampling strategy that is suboptimal under varying motion dynamics. In this paper, we propose an event sparsity-aware tracking framework that explicitly models event-density variations across multiple temporal scales. Specifically, the proposed framework progressively injects sparse, medium-density, and dense event search regions into a three-stage Vision Transformer backbone, enabling hierarchical multi-density feature learning. Furthermore, we introduce a sparsity-aware Mixture-of-Experts module to encourage expert specialization under different sparsity patterns, and design a dynamic pondering strategy to adaptively adjust the inference depth according to tracking difficulty. Extensive experiments on FE240hz, COESOT, and EventVOT demonstrate that the proposed approach achieves a favorable trade-off between tracking accuracy and computational efficiency. The source code will be released on https://github.com/Event-AHU/OpenEvTracking.