CADTrack: Learning Contextual Aggregation with Deformable Alignment for Robust RGBT Tracking

๐Ÿ“… 2025-11-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing RGB-thermal (RGBT) trackers suffer from significant inter-modal discrepancies, leading to non-robust feature representations and inefficient cross-modal fusionโ€”severely limiting tracking accuracy and robustness. To address this, we propose a novel RGBT tracking framework comprising three key components: (1) a Mamba-based architecture enabling linear-complexity, temporal-aware cross-modal feature interaction; (2) a Mixture-of-Experts (MoE)-driven context aggregation module that dynamically models multi-scale semantic dependencies; and (3) a deformable alignment module enhancing spatial consistency between thermal and RGB features. Evaluated on five mainstream RGBT benchmarks, our method consistently outperforms state-of-the-art approaches, delivering substantial improvements in both accuracy and robustness across diverse illumination and weather conditions. The source code is publicly available.

Technology Category

Application Category

๐Ÿ“ Abstract
RGB-Thermal (RGBT) tracking aims to exploit visible and thermal infrared modalities for robust all-weather object tracking. However, existing RGBT trackers struggle to resolve modality discrepancies, which poses great challenges for robust feature representation. This limitation hinders effective cross-modal information propagation and fusion, which significantly reduces the tracking accuracy. To address this limitation, we propose a novel Contextual Aggregation with Deformable Alignment framework called CADTrack for RGBT Tracking. To be specific, we first deploy the Mamba-based Feature Interaction (MFI) that establishes efficient feature interaction via state space models. This interaction module can operate with linear complexity, reducing computational cost and improving feature discrimination. Then, we propose the Contextual Aggregation Module (CAM) that dynamically activates backbone layers through sparse gating based on the Mixture-of-Experts (MoE). This module can encode complementary contextual information from cross-layer features. Finally, we propose the Deformable Alignment Module (DAM) to integrate deformable sampling and temporal propagation, mitigating spatial misalignment and localization drift. With the above components, our CADTrack achieves robust and accurate tracking in complex scenarios. Extensive experiments on five RGBT tracking benchmarks verify the effectiveness of our proposed method. The source code is released at https://github.com/IdolLab/CADTrack.
Problem

Research questions and friction points this paper is trying to address.

Resolving modality discrepancies between visible and thermal infrared data
Improving cross-modal feature propagation and fusion efficiency
Mitigating spatial misalignment and localization drift in tracking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba-based Feature Interaction with linear complexity
Contextual Aggregation Module using Mixture-of-Experts
Deformable Alignment Module for spatial-temporal correction
๐Ÿ”Ž Similar Papers
No similar papers found.
H
Hao Li
College of Command and Control Engineering, Army Engineering University of PLA
Y
Yuhao Wang
School of Future Technology, Dalian University of Technology
Xiantao Hu
Xiantao Hu
Nanjing University of Science & Technology
Computer VIsion
W
Wenning Hao
College of Command and Control Engineering, Army Engineering University of PLA
P
Pingping Zhang
School of Future Technology, Dalian University of Technology
D
Dong Wang
School of Information and Communication Engineering, Dalian University of Technology
H
Huchuan Lu
School of Future Technology, Dalian University of Technology; School of Information and Communication Engineering, Dalian University of Technology