Visual and Memory Dual Adapter for Multi-Modal Object Tracking

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing prompt-based multimodal trackers suffer from inadequate modeling of critical cues in both the frequency and temporal domains, resulting in non-robust prompt generation. To address this, we propose the Vision-and-Memory Dual Adapter (VMDA). The vision adapter jointly models frequency-domain, spatial, and channel-wise features to enable adaptive fusion of auxiliary modality cues; the memory adapter incorporates a human-inspired dynamic memory mechanism for online updating and retrieval of global temporal information. Together, these components enhance cross-modal discriminative representation learning and preserve temporal consistency. Evaluated on RGB-Thermal, RGB-Depth, and RGB-Event multimodal tracking benchmarks, VMDA achieves state-of-the-art performance, demonstrating significant improvements in accuracy and robustness under challenging scenarios involving occlusion, illumination variation, and motion blur.

Technology Category

Application Category

📝 Abstract
Prompt-learning-based multi-modal trackers have achieved promising progress by employing lightweight visual adapters to incorporate auxiliary modality features into frozen foundation models. However, existing approaches often struggle to learn reliable prompts due to limited exploitation of critical cues across frequency and temporal domains. In this paper, we propose a novel visual and memory dual adapter (VMDA) to construct more robust and discriminative representations for multi-modal tracking. Specifically, we develop a simple but effective visual adapter that adaptively transfers discriminative cues from auxiliary modality to dominant modality by jointly modeling the frequency, spatial, and channel-wise features. Additionally, we design the memory adapter inspired by the human memory mechanism, which stores global temporal cues and performs dynamic update and retrieval operations to ensure the consistent propagation of reliable temporal information across video sequences. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the various multi-modal tracking tasks, including RGB-Thermal, RGB-Depth, and RGB-Event tracking. Code and models are available at https://github.com/xuboyue1999/mmtrack.git.
Problem

Research questions and friction points this paper is trying to address.

Enhances multi-modal tracking with robust visual and memory adapters
Improves prompt-learning by exploiting frequency and temporal cues
Achieves state-of-the-art performance in RGB-Thermal, RGB-Depth, RGB-Event tracking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual adapter for multi-modal feature transfer
Memory adapter for temporal cue propagation
Joint frequency, spatial, and channel modeling
🔎 Similar Papers
No similar papers found.
Boyue Xu
Boyue Xu
Nanjing University
Ruichao Hou
Ruichao Hou
Nanjing University
Information FusionMultimedia Computing
Tongwei Ren
Tongwei Ren
Nanjing University
multimedia computing
G
Gangshan Wu
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210008, China