X Modality Assisting RGBT Object Tracking

📅 2023-12-27

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This paper addresses key challenges in RGB-T (RGB-thermal) object tracking—namely, large inter-modal discrepancies, weak cross-modal feature coupling, and susceptibility to drift—by proposing X-Net, a three-tier decoupled fusion framework integrating pixel-level generation, feature-level interaction, and decision-level optimization. Its core contributions are: (1) the novel X-modality auxiliary generation mechanism, which employs a self-knowledge distillation-based Pixel Generation Module (PGM) to bridge the modality gap; (2) a hybrid feature interaction Transformer coupled with spatial-dimension feature translation to enhance cross-modal feature alignment; and (3) a flow-guided online Decision Refinement Module (DRM) for real-time drift correction. Evaluated on three standard RGB-T benchmarks, X-Net achieves significant improvements over state-of-the-art methods, markedly enhancing localization accuracy and tracking robustness—especially under complex scenarios involving occlusion, illumination changes, and thermal ambiguity.

📝 Abstract

Learning robust multi-modal feature representations is critical for boosting tracking performance. To this end, we propose a novel X Modality Assisting Network (X-Net) to shed light on the impact of the fusion paradigm by decoupling the visual object tracking into three distinct levels, facilitating subsequent processing. Firstly, to tackle the feature learning hurdles stemming from significant differences between RGB and thermal modalities, a plug-and-play pixel-level generation module (PGM) is proposed based on self-knowledge distillation learning, which effectively generates X modality to bridge the gap between the dual patterns while reducing noise interference. Subsequently, to further achieve the optimal sample feature representation and facilitate cross-modal interactions, we propose a feature-level interaction module (FIM) that incorporates a mixed feature interaction transformer and a spatial-dimensional feature translation strategy. Ultimately, aiming at random drifting due to missing instance features, we propose a flexible online optimized strategy called the decision-level refinement module (DRM), which contains optical flow and refinement mechanisms. Experiments are conducted on three benchmarks to verify that the proposed X-Net outperforms state-of-the-art trackers.

Problem

Research questions and friction points this paper is trying to address.

Enhance object tracking with multi-modal features.

Bridge RGB and thermal modality discrepancies.

Optimize feature representation and prevent tracking drift.

Innovation

Methods, ideas, or system contributions that make the work stand out.

X Modality Assisting Network (X-Net)

Plug-and-play pixel-level generation module (PGM)

Decision-level refinement module (DRM)

🔎 Similar Papers

No similar papers found.