🤖 AI Summary
This work proposes a frequency-domain early fusion framework for RGB-event visual object tracking, addressing the limitations of existing feature-level fusion approaches that fail to fully exploit the high dynamic range and motion sensitivity of event cameras while incurring redundant computation in low-information regions. By applying the fast Fourier transform, the method decouples RGB and event signals into magnitude and phase components, and introduces a magnitude-phase attention mechanism to selectively integrate high-frequency event information. Furthermore, a motion-guided spatial sparsification module is incorporated to retain only target-relevant features for the backbone network. This approach achieves, for the first time, modality disentanglement and attentive fusion in the frequency domain, significantly enhancing representation capability while reducing computational overhead. It attains state-of-the-art performance on the FE108, FELT, and COESOT benchmarks.
📝 Abstract
Existing RGB-Event visual object tracking approaches primarily rely on conventional feature-level fusion, failing to fully exploit the unique advantages of event cameras. In particular, the high dynamic range and motion-sensitive nature of event cameras are often overlooked, while low-information regions are processed uniformly, leading to unnecessary computational overhead for the backbone network. To address these issues, we propose a novel tracking framework that performs early fusion in the frequency domain, enabling effective aggregation of high-frequency information from the event modality. Specifically, RGB and event modalities are transformed from the spatial domain to the frequency domain via the Fast Fourier Transform, with their amplitude and phase components decoupled. High-frequency event information is selectively fused into RGB modality through amplitude and phase attention, enhancing feature representation while substantially reducing backbone computation. In addition, a motion-guided spatial sparsification module leverages the motion-sensitive nature of event cameras to capture the relationship between target motion cues and spatial probability distribution, filtering out low-information regions and enhancing target-relevant features. Finally, a sparse set of target-relevant features is fed into the backbone network for learning, and the tracking head predicts the final target position. Extensive experiments on three widely used RGB-Event tracking benchmark datasets, including FE108, FELT, and COESOT, demonstrate the high performance and efficiency of our method. The source code of this paper will be released on https://github.com/Event-AHU/OpenEvTracking