Learning Progressive Adaptation for Multi-Modal Tracking

📅 2026-03-22

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the challenge of effectively transferring pretrained RGB models to multimodal tracking tasks under limited paired data, where existing methods struggle due to insufficient adaptation to modality-specific characteristics, cross-modal interactions, and task-specific prediction heads. To overcome this, we propose PATrack, a progressive adaptation framework that introduces intra-modal high/low-frequency decomposition and a shared-information-guided cross-modal attention mechanism. PATrack further incorporates a three-tier parameter-efficient adapter architecture—comprising modality-specific, cross-modal interaction, and task-level prediction heads—to enable multi-level collaborative optimization. Extensive experiments on RGB+thermal, RGB+depth, and RGB+event multimodal tracking benchmarks demonstrate that PATrack consistently outperforms state-of-the-art approaches, validating its effectiveness and generalization capability.

Technology Category

Application Category

📝 Abstract

Due to the limited availability of paired multi-modal data, multi-modal trackers are typically built by adopting pre-trained RGB models with parameter-efficient fine-tuning modules. However, these fine-tuning methods overlook advanced adaptations for applying RGB pre-trained models and fail to modulate a single specific modality, cross-modal interactions, and the prediction head. To address the issues, we propose to perform Progressive Adaptation for Multi-Modal Tracking (PATrack). This innovative approach incorporates modality-dependent, modality-entangled, and task-level adapters, effectively bridging the gap in adapting RGB pre-trained networks to multi-modal data through a progressive strategy. Specifically, modality-specific information is enhanced through the modality-dependent adapter, decomposing the high- and low-frequency components, which ensures a more robust feature representation within each modality. The inter-modal interactions are introduced in the modality-entangled adapter, which implements a cross-attention operation guided by inter-modal shared information, ensuring the reliability of features conveyed between modalities. Additionally, recognising that the strong inductive bias of the prediction head does not adapt to the fused information, a task-level adapter specific to the prediction head is introduced. In summary, our design integrates intra-modal, inter-modal, and task-level adapters into a unified framework. Extensive experiments on RGB+Thermal, RGB+Depth, and RGB+Event tracking tasks demonstrate that our method shows impressive performance against state-of-the-art methods. Code is available at https://github.com/ouha1998/Learning-Progressive-Adaptation-for-Multi-Modal-Tracking.

Problem

Research questions and friction points this paper is trying to address.

multi-modal tracking

RGB pre-trained models

parameter-efficient fine-tuning

cross-modal interactions

prediction head adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive Adaptation

Modality-Dependent Adapter

Modality-Entangled Adapter