SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker

๐Ÿ“… 2026-04-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

225K/year
๐Ÿค– AI Summary
This work addresses the parameter inefficiency of existing multimodal tracking methods, which often rely on excessive model parameters and violate the principle of parameter-efficient fine-tuning. To overcome this limitation, the authors propose a concise and efficient dual-stream framework that achieves dynamic cross-modal attention alignment through Adaptive Mutual-Guided LoRA (AMG-LoRA) and incorporates a Hierarchical Mixture-of-Experts (HMoE) mechanism for effective global relational modeling. The proposed approach significantly enhances tracking performance across RGB-T, RGB-D, and RGB-E modalities while maintaining minimal parameter overhead, thereby achieving a superior trade-off between computational efficiency and accuracy.

Technology Category

Application Category

๐Ÿ“ Abstract
Parameter-efficient fine-tuning (PEFT) in multimodal tracking reveals a concerning trend where recent performance gains are often achieved at the cost of inflated parameter budgets, which fundamentally erodes PEFT's efficiency promise. In this work, we introduce SEATrack, a Simple, Efficient, and Adaptive two-stream multimodal tracker that tackles this performance-efficiency dilemma from two complementary perspectives. We first prioritize cross-modal alignment of matching responses, an underexplored yet pivotal factor that we argue is essential for breaking the trade-off. Specifically, we observe that modality-specific biases in existing two-stream methods generate conflicting matching attention maps, thereby hindering effective joint representation learning. To mitigate this, we propose AMG-LoRA, which seamlessly integrates Low-Rank Adaptation (LoRA) for domain adaptation with Adaptive Mutual Guidance (AMG) to dynamically refine and align attention maps across modalities. We then depart from conventional local fusion approaches by introducing a Hierarchical Mixture of Experts (HMoE) that enables efficient global relation modeling, effectively balancing expressiveness and computational efficiency in cross-modal fusion. Equipped with these innovations, SEATrack advances notable progress over state-of-the-art methods in balancing performance with efficiency across RGB-T, RGB-D, and RGB-E tracking tasks. \href{https://github.com/AutoLab-SAI-SJTU/SEATrack}{\textcolor{cyan}{Code is available}}.
Problem

Research questions and friction points this paper is trying to address.

multimodal tracking
parameter-efficient fine-tuning
performance-efficiency trade-off
cross-modal alignment
two-stream architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

AMG-LoRA
Hierarchical Mixture of Experts
Cross-modal Alignment
Parameter-efficient Fine-tuning
Multimodal Tracking