Sparse-Dense Mixture of Experts Adapter for Multi-Modal Tracking

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing parameter-efficient fine-tuning methods struggle to jointly model heterogeneous multimodal features, limiting tracking performance. This work proposes a Sparse-Dense Mixture-of-Experts Adapter (SDMoEA) framework, where sparse experts capture modality-specific information and dense shared experts model cross-modal commonalities. Furthermore, a Gram matrix–based semantic alignment hypergraph fusion module is introduced to capture high-order cross-modal dependencies. To the best of our knowledge, this is the first approach to integrate a mixture-of-experts architecture with hypergraph neural networks for multimodal tracking. The proposed method achieves state-of-the-art performance across seven benchmarks—including LasHeR, RGBT234, and VTUAV—significantly outperforming existing parameter-efficient approaches.

Technology Category

Application Category

📝 Abstract
Parameter-efficient fine-tuning (PEFT) techniques, such as prompts and adapters, are widely used in multi-modal tracking because they alleviate issues of full-model fine-tuning, including time inefficiency, high resource consumption, parameter storage burden, and catastrophic forgetting. However, due to cross-modal heterogeneity, most existing PEFT-based methods struggle to effectively represent multi-modal features within a unified framework with shared parameters. To address this problem, we propose a novel Sparse-Dense Mixture of Experts Adapter (SDMoEA) framework for PEFT-based multi-modal tracking under a unified model structure. Specifically, we design an SDMoE module as the multi-modal adapter to model modality-specific and shared information efficiently. SDMoE consists of a sparse MoE and a dense-shared MoE: the former captures modality-specific information, while the latter models shared cross-modal information. Furthermore, to overcome limitations of existing tracking methods in modeling high-order correlations during multi-level multi-modal fusion, we introduce a Gram-based Semantic Alignment Hypergraph Fusion (GSAHF) module. It first employs Gram matrices for cross-modal semantic alignment, ensuring that the constructed hypergraph accurately reflects semantic similarity and high-order dependencies between modalities. The aligned features are then integrated into the hypergraph structure to exploit its ability to model high-order relationships, enabling deep fusion of multi-level multi-modal information. Extensive experiments demonstrate that the proposed method achieves superior performance compared with other PEFT approaches on several multi-modal tracking benchmarks, including LasHeR, RGBT234, VTUAV, VisEvent, COESOT, DepthTrack, and VOT-RGBD2022.
Problem

Research questions and friction points this paper is trying to address.

multi-modal tracking
parameter-efficient fine-tuning
cross-modal heterogeneity
high-order correlation
feature fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse-Dense Mixture of Experts
Parameter-Efficient Fine-Tuning
Multi-Modal Tracking
Hypergraph Fusion
Gram Matrix Alignment
🔎 Similar Papers
No similar papers found.
Y
Yabin Zhu
School of Public Security and Emergency Management, Anhui University of Science and Technology, Hefei 231131, China
J
Jianqi Li
School of Public Security and Emergency Management, Anhui University of Science and Technology, Hefei 231131, China
Chenglong Li
Chenglong Li
Professor, The University of Florida
Drug DesignDrug DiscoveryMolecular RecognitionMolecular ModelingProtein structure and Dynamics
Jiaxiang Wang
Jiaxiang Wang
King's College London
semantic communicationsgenerative aimachine learningwireless communicationinformation theory
C
Chengjie Gu
School of Public Security and Emergency Management, Anhui University of Science and Technology, Hefei 231131, China
Jin Tang
Jin Tang
Anhui University
Computer visionintelligent video analysis