🤖 AI Summary
This work addresses the challenging problem of tracking arbitrary points (TAP) under difficult illumination conditions and high-speed motion using event cameras—introducing event cameras to TAP for the first time and proposing the first purely event-driven, end-to-end tracking framework. Methodologically: (1) it designs a data modeling scheme and global contextual feature learning mechanism tailored to asynchronous, sparse event streams; (2) it introduces a motion-robust feature alignment loss to mitigate event feature distortion caused by high-speed motion; and (3) it establishes a novel event data generation pipeline. Experiments on an event-based tracking benchmark show that our method achieves a 136% average Jaccard improvement over baseline methods, outperforms the previous best purely event-based approach by 20%, and surpasses the strongest event-frame fusion method by 4.1%. The framework significantly enhances tracking robustness and accuracy under extreme lighting changes and high-speed scenarios.
📝 Abstract
Tracking any point (TAP) recently shifted the motion estimation paradigm from focusing on individual salient points with local templates to tracking arbitrary points with global image contexts. However, while research has mostly focused on driving the accuracy of models in nominal settings, addressing scenarios with difficult lighting conditions and high-speed motions remains out of reach due to the limitations of the sensor. This work addresses this challenge with the first event camera-based TAP method. It leverages the high temporal resolution and high dynamic range of event cameras for robust high-speed tracking, and the global contexts in TAP methods to handle asynchronous and sparse event measurements. We further extend the TAP framework to handle event feature variations induced by motion -- thereby addressing an open challenge in purely event-based tracking -- with a novel feature-alignment loss which ensures the learning of motion-robust features. Our method is trained with data from a new data generation pipeline and systematically ablated across all design decisions. Our method shows strong cross-dataset generalization and performs 136% better on the average Jaccard metric than the baselines. Moreover, on an established feature tracking benchmark, it achieves a 20% improvement over the previous best event-only method and even surpasses the previous best events-and-frames method by 4.1%. Our code is available at https://github.com/tub-rip/ETAP