TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance degradation of existing arbitrary-point tracking methods under complex illumination and motion conditions, where modality failure and temporal misalignment commonly occur. To overcome these challenges, the authors propose a Transformer-based asynchronous multimodal fusion framework that integrates RGB frames and event streams. The framework features a Transient Asynchronous Fusion (TAF) mechanism to model inter-frame continuous event evolution and a Cross-Modal Local Weighted Fusion (CLWF) module that adaptively modulates spatial attention. Evaluated on a newly constructed real-world frame-event dataset, the method reduces average pixel error by 28.2% and consistently achieves state-of-the-art performance on standard point tracking benchmarks.

Technology Category

Application Category

📝 Abstract
Tracking any point (TAP) is a fundamental yet challenging task in computer vision, requiring high precision and long-term motion reasoning. Recent attempts to combine RGB frames and event streams have shown promise, yet they typically rely on synchronous or non-adaptive fusion, leading to temporal misalignment and severe degradation when one modality fails. We introduce TAPFormer, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking. Our key innovation is a Transient Asynchronous Fusion (TAF) mechanism, which explicitly models the temporal evolution between discrete frames through continuous event updates, bridging the gap between low-rate frames and high-rate events. In addition, a Cross-modal Locally Weighted Fusion (CLWF) module adaptively adjusts spatial attention according to modality reliability, yielding stable and discriminative features even under blur or low light. To evaluate our approach under realistic conditions, we construct a novel real-world frame-event TAP dataset under diverse illumination and motion conditions. Our method outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold. Moreover, on standard point tracking benchmarks, our tracker consistently achieves the best performance. Project website: tapformer.github.io
Problem

Research questions and friction points this paper is trying to address.

arbitrary point tracking
frame-event fusion
temporal misalignment
modality failure
robust tracking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transient Asynchronous Fusion
Cross-modal Fusion
Event-based Vision
Arbitrary Point Tracking
Transformer
🔎 Similar Papers
No similar papers found.