Unified Multimodal Visual Tracking with Dual Mixture-of-Experts

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the inefficiency and poor scalability of existing multimodal visual object tracking methods, which typically require separate modeling due to modality heterogeneity. To overcome these limitations, the authors propose OneTrackerV2, the first unified tracking framework capable of handling arbitrary visual modalities as input. Its core innovations include a Meta Merger for multimodal fusion under a unified representation and a dual-path mixture-of-experts mechanism—comprising T-MoE and M-MoE—to model spatiotemporal dynamics and cross-modal knowledge, respectively. Trained end-to-end, OneTrackerV2 achieves state-of-the-art performance across five RGB and RGB+X tracking tasks on twelve benchmarks, while maintaining high inference efficiency. Notably, the model remains competitive even after compression and demonstrates exceptional robustness and generalization under missing-modality scenarios.

📝 Abstract

Multimodal visual object tracking can be divided into to several kinds of tasks (e.g. RGB and RGB+X tracking), based on the input modality. Existing methods often train separate models for each modality or rely on pretrained models to adapt to new modalities, which limits efficiency, scalability, and usability. Thus, we introduce OneTrackerV2, a unified multi-modal tracking framework that enables end-to-end training for any modality. We propose Meta Merger to embed multi-modal information into a unified space, allowing flexible modality fusion and robustness. We further introduce Dual Mixture-of-Experts (DMoE): T-MoE models spatio-temporal relations for tracking, while M-MoE embeds multi-modal knowledge, disentangling cross-modal dependencies and reducing feature conflicts. With a shared architecture, unified parameters, and a single end-to-end training, OneTrackerV2 achieves state-of-the-art performance across five RGB and RGB+X tracking tasks and 12 benchmarks, while maintaining high inference efficiency. Notably, even after model compression, OneTrackerV2 retains strong performance. Moreover, OneTrackerV2 demonstrates remarkable robustness under modality-missing scenarios.

Problem

Research questions and friction points this paper is trying to address.

multimodal visual tracking

modality fusion

model scalability

tracking efficiency

cross-modal dependencies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Multimodal Tracking

Dual Mixture-of-Experts

Meta Merger