🤖 AI Summary
This work addresses the inefficiency and poor scalability of existing multimodal visual object tracking methods, which typically require separate modeling due to modality heterogeneity. To overcome these limitations, the authors propose OneTrackerV2, the first unified tracking framework capable of handling arbitrary visual modalities as input. Its core innovations include a Meta Merger for multimodal fusion under a unified representation and a dual-path mixture-of-experts mechanism—comprising T-MoE and M-MoE—to model spatiotemporal dynamics and cross-modal knowledge, respectively. Trained end-to-end, OneTrackerV2 achieves state-of-the-art performance across five RGB and RGB+X tracking tasks on twelve benchmarks, while maintaining high inference efficiency. Notably, the model remains competitive even after compression and demonstrates exceptional robustness and generalization under missing-modality scenarios.
📝 Abstract
Multimodal visual object tracking can be divided into to several kinds of tasks (e.g. RGB and RGB+X tracking), based on the input modality. Existing methods often train separate models for each modality or rely on pretrained models to adapt to new modalities, which limits efficiency, scalability, and usability. Thus, we introduce OneTrackerV2, a unified multi-modal tracking framework that enables end-to-end training for any modality. We propose Meta Merger to embed multi-modal information into a unified space, allowing flexible modality fusion and robustness. We further introduce Dual Mixture-of-Experts (DMoE): T-MoE models spatio-temporal relations for tracking, while M-MoE embeds multi-modal knowledge, disentangling cross-modal dependencies and reducing feature conflicts. With a shared architecture, unified parameters, and a single end-to-end training, OneTrackerV2 achieves state-of-the-art performance across five RGB and RGB+X tracking tasks and 12 benchmarks, while maintaining high inference efficiency. Notably, even after model compression, OneTrackerV2 retains strong performance. Moreover, OneTrackerV2 demonstrates remarkable robustness under modality-missing scenarios.