Sparse-Dense Mixture of Experts Adapter for Multi-Modal Tracking

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

208K/year
🤖 AI Summary
Existing parameter-efficient fine-tuning methods struggle to jointly model heterogeneous multimodal features, limiting tracking performance. This work proposes a Sparse-Dense Mixture-of-Experts Adapter (SDMoEA) framework, where sparse experts capture modality-specific information and dense shared experts model cross-modal commonalities. Furthermore, a Gram matrix–based semantic alignment hypergraph fusion module is introduced to capture high-order cross-modal dependencies. To the best of our knowledge, this is the first approach to integrate a mixture-of-experts architecture with hypergraph neural networks for multimodal tracking. The proposed method achieves state-of-the-art performance across seven benchmarks—including LasHeR, RGBT234, and VTUAV—significantly outperforming existing parameter-efficient approaches.

Technology Category

Application Category

📝 Abstract
Parameter-efficient fine-tuning (PEFT) techniques, such as prompts and adapters, are widely used in multi-modal tracking because they alleviate issues of full-model fine-tuning, including time inefficiency, high resource consumption, parameter storage burden, and catastrophic forgetting. However, due to cross-modal heterogeneity, most existing PEFT-based methods struggle to effectively represent multi-modal features within a unified framework with shared parameters. To address this problem, we propose a novel Sparse-Dense Mixture of Experts Adapter (SDMoEA) framework for PEFT-based multi-modal tracking under a unified model structure. Specifically, we design an SDMoE module as the multi-modal adapter to model modality-specific and shared information efficiently. SDMoE consists of a sparse MoE and a dense-shared MoE: the former captures modality-specific information, while the latter models shared cross-modal information. Furthermore, to overcome limitations of existing tracking methods in modeling high-order correlations during multi-level multi-modal fusion, we introduce a Gram-based Semantic Alignment Hypergraph Fusion (GSAHF) module. It first employs Gram matrices for cross-modal semantic alignment, ensuring that the constructed hypergraph accurately reflects semantic similarity and high-order dependencies between modalities. The aligned features are then integrated into the hypergraph structure to exploit its ability to model high-order relationships, enabling deep fusion of multi-level multi-modal information. Extensive experiments demonstrate that the proposed method achieves superior performance compared with other PEFT approaches on several multi-modal tracking benchmarks, including LasHeR, RGBT234, VTUAV, VisEvent, COESOT, DepthTrack, and VOT-RGBD2022.
Problem

Research questions and friction points this paper is trying to address.

multi-modal tracking
parameter-efficient fine-tuning
cross-modal heterogeneity
high-order correlation
feature fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse-Dense Mixture of Experts
Parameter-Efficient Fine-Tuning
Multi-Modal Tracking
Hypergraph Fusion
Gram Matrix Alignment
🔎 Similar Papers
No similar papers found.