Visual and Memory Dual Adapter for Multi-Modal Object Tracking

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing prompt-based multimodal trackers suffer from inadequate modeling of critical cues in both the frequency and temporal domains, resulting in non-robust prompt generation. To address this, we propose the Vision-and-Memory Dual Adapter (VMDA). The vision adapter jointly models frequency-domain, spatial, and channel-wise features to enable adaptive fusion of auxiliary modality cues; the memory adapter incorporates a human-inspired dynamic memory mechanism for online updating and retrieval of global temporal information. Together, these components enhance cross-modal discriminative representation learning and preserve temporal consistency. Evaluated on RGB-Thermal, RGB-Depth, and RGB-Event multimodal tracking benchmarks, VMDA achieves state-of-the-art performance, demonstrating significant improvements in accuracy and robustness under challenging scenarios involving occlusion, illumination variation, and motion blur.

Technology Category

Application Category

📝 Abstract

Prompt-learning-based multi-modal trackers have achieved promising progress by employing lightweight visual adapters to incorporate auxiliary modality features into frozen foundation models. However, existing approaches often struggle to learn reliable prompts due to limited exploitation of critical cues across frequency and temporal domains. In this paper, we propose a novel visual and memory dual adapter (VMDA) to construct more robust and discriminative representations for multi-modal tracking. Specifically, we develop a simple but effective visual adapter that adaptively transfers discriminative cues from auxiliary modality to dominant modality by jointly modeling the frequency, spatial, and channel-wise features. Additionally, we design the memory adapter inspired by the human memory mechanism, which stores global temporal cues and performs dynamic update and retrieval operations to ensure the consistent propagation of reliable temporal information across video sequences. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the various multi-modal tracking tasks, including RGB-Thermal, RGB-Depth, and RGB-Event tracking. Code and models are available at https://github.com/xuboyue1999/mmtrack.git.

Problem

Research questions and friction points this paper is trying to address.

Enhances multi-modal tracking with robust visual and memory adapters

Improves prompt-learning by exploiting frequency and temporal cues

Achieves state-of-the-art performance in RGB-Thermal, RGB-Depth, RGB-Event tracking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual adapter for multi-modal feature transfer

Memory adapter for temporal cue propagation

Joint frequency, spatial, and channel modeling

🔎 Similar Papers

No similar papers found.