Decoupling Amplitude and Phase Attention in Frequency Domain for RGB-Event based Visual Object Tracking

📅 2026-01-03
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a frequency-domain early fusion framework for RGB-event visual object tracking, addressing the limitations of existing feature-level fusion approaches that fail to fully exploit the high dynamic range and motion sensitivity of event cameras while incurring redundant computation in low-information regions. By applying the fast Fourier transform, the method decouples RGB and event signals into magnitude and phase components, and introduces a magnitude-phase attention mechanism to selectively integrate high-frequency event information. Furthermore, a motion-guided spatial sparsification module is incorporated to retain only target-relevant features for the backbone network. This approach achieves, for the first time, modality disentanglement and attentive fusion in the frequency domain, significantly enhancing representation capability while reducing computational overhead. It attains state-of-the-art performance on the FE108, FELT, and COESOT benchmarks.

Technology Category

Application Category

📝 Abstract
Existing RGB-Event visual object tracking approaches primarily rely on conventional feature-level fusion, failing to fully exploit the unique advantages of event cameras. In particular, the high dynamic range and motion-sensitive nature of event cameras are often overlooked, while low-information regions are processed uniformly, leading to unnecessary computational overhead for the backbone network. To address these issues, we propose a novel tracking framework that performs early fusion in the frequency domain, enabling effective aggregation of high-frequency information from the event modality. Specifically, RGB and event modalities are transformed from the spatial domain to the frequency domain via the Fast Fourier Transform, with their amplitude and phase components decoupled. High-frequency event information is selectively fused into RGB modality through amplitude and phase attention, enhancing feature representation while substantially reducing backbone computation. In addition, a motion-guided spatial sparsification module leverages the motion-sensitive nature of event cameras to capture the relationship between target motion cues and spatial probability distribution, filtering out low-information regions and enhancing target-relevant features. Finally, a sparse set of target-relevant features is fed into the backbone network for learning, and the tracking head predicts the final target position. Extensive experiments on three widely used RGB-Event tracking benchmark datasets, including FE108, FELT, and COESOT, demonstrate the high performance and efficiency of our method. The source code of this paper will be released on https://github.com/Event-AHU/OpenEvTracking
Problem

Research questions and friction points this paper is trying to address.

RGB-Event tracking
feature fusion
event cameras
computational overhead
high dynamic range
Innovation

Methods, ideas, or system contributions that make the work stand out.

frequency domain fusion
amplitude-phase decoupling
event camera
motion-guided sparsification
visual object tracking
🔎 Similar Papers
No similar papers found.
Shiao Wang
Shiao Wang
安徽大学
Deep Learning
X
Xiao Wang
School of Computer Science and Technology, Anhui University, Hefei 230601, China
H
Haonan Zhao
Northeastern University, Shenyang, China
Jiarui Xu
Jiarui Xu
University of Sydney
MLOps
Bo Jiang
Bo Jiang
Anhui University
Computer Vision and Pattern Recognition
Lin Zhu
Lin Zhu
Assistant Professor, School of Computer Science & Techonology, Beijing Institute of Technology
Neuromorphic visionVideo processingEvent-based visionSpiking neural network
Xin Zhao
Xin Zhao
Professor, University of Science and Technology Beijing (USTB)
Computer VisionData-Centric AIAI4Science
Y
Yonghong Tian
Peng Cheng Laboratory, Shenzhen, China; National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, China; School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, China
Jin Tang
Jin Tang
Anhui University
Computer visionintelligent video analysis