🤖 AI Summary
Existing RGB-D tracking methods typically employ single-level bimodal feature fusion, resulting in limited robustness and low inference speed. To address these limitations, this paper proposes a Hierarchical Modality Aggregation and Distribution network (HMAD), the first framework to enable cross-level collaborative modeling of RGB and depth features. HMAD leverages a dual-stream neural network to extract multi-level features, incorporates a cross-level attention mechanism for modality-adaptive weighted fusion, and introduces an adaptive depth feature calibration and distribution module to jointly account for modality heterogeneity and hierarchical complementarity. Evaluated on multiple standard RGB-D benchmark datasets, HMAD achieves state-of-the-art (SOTA) performance with real-time inference speed exceeding 32 FPS. It significantly improves tracking robustness, generalization capability, and interference resilience—particularly under challenging scenarios involving occlusion, illumination variation, and sensor noise.
📝 Abstract
The integration of dual-modal features has been pivotal in advancing RGB-Depth (RGB-D) tracking. However, current trackers are less efficient and focus solely on single-level features, resulting in weaker robustness in fusion and slower speeds that fail to meet the demands of real-world applications. In this paper, we introduce a novel network, denoted as HMAD (Hierarchical Modality Aggregation and Distribution), which addresses these challenges. HMAD leverages the distinct feature representation strengths of RGB and depth modalities, giving prominence to a hierarchical approach for feature distribution and fusion, thereby enhancing the robustness of RGB-D tracking. Experimental results on various RGB-D datasets demonstrate that HMAD achieves state-of-the-art performance. Moreover, real-world experiments further validate HMAD’s capacity to effectively handle a spectrum of tracking challenges in real-time scenarios.