Unified Multimodal Visual Tracking with Dual Mixture-of-Experts

📅 2026-05-05
📈 Citations: 0
Influential: 0
📄 PDF

career value

186K/year
🤖 AI Summary
This work addresses the inefficiency and poor scalability of existing multimodal visual object tracking methods, which typically require separate modeling due to modality heterogeneity. To overcome these limitations, the authors propose OneTrackerV2, the first unified tracking framework capable of handling arbitrary visual modalities as input. Its core innovations include a Meta Merger for multimodal fusion under a unified representation and a dual-path mixture-of-experts mechanism—comprising T-MoE and M-MoE—to model spatiotemporal dynamics and cross-modal knowledge, respectively. Trained end-to-end, OneTrackerV2 achieves state-of-the-art performance across five RGB and RGB+X tracking tasks on twelve benchmarks, while maintaining high inference efficiency. Notably, the model remains competitive even after compression and demonstrates exceptional robustness and generalization under missing-modality scenarios.
📝 Abstract
Multimodal visual object tracking can be divided into to several kinds of tasks (e.g. RGB and RGB+X tracking), based on the input modality. Existing methods often train separate models for each modality or rely on pretrained models to adapt to new modalities, which limits efficiency, scalability, and usability. Thus, we introduce OneTrackerV2, a unified multi-modal tracking framework that enables end-to-end training for any modality. We propose Meta Merger to embed multi-modal information into a unified space, allowing flexible modality fusion and robustness. We further introduce Dual Mixture-of-Experts (DMoE): T-MoE models spatio-temporal relations for tracking, while M-MoE embeds multi-modal knowledge, disentangling cross-modal dependencies and reducing feature conflicts. With a shared architecture, unified parameters, and a single end-to-end training, OneTrackerV2 achieves state-of-the-art performance across five RGB and RGB+X tracking tasks and 12 benchmarks, while maintaining high inference efficiency. Notably, even after model compression, OneTrackerV2 retains strong performance. Moreover, OneTrackerV2 demonstrates remarkable robustness under modality-missing scenarios.
Problem

Research questions and friction points this paper is trying to address.

multimodal visual tracking
modality fusion
model scalability
tracking efficiency
cross-modal dependencies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Multimodal Tracking
Dual Mixture-of-Experts
Meta Merger
End-to-End Training
Modality Robustness
🔎 Similar Papers
No similar papers found.