What You Have is What You Track: Adaptive and Robust Multimodal Tracking

📅 2025-07-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In visual tracking, multimodal sensors often suffer from synchronization failures, leading to temporal data incompleteness and severely degrading tracker performance—existing methods, with fixed architectures, struggle to adapt to dynamic modality dropouts. This work is the first systematic study on robust tracking under temporally incomplete multimodal inputs. We propose an adaptive multimodal tracking framework featuring a heterogeneous Mixture-of-Experts (MoE) fusion mechanism, integrating dynamic computation routing and video-level masking to jointly preserve temporal continuity and spatial feature integrity. The model automatically activates specialized computational units in real time based on modality dropout rates and scene complexity. Evaluated on nine benchmarks, our method consistently outperforms state-of-the-art approaches across full-modality and diverse partial-modality settings—including varying dropout patterns and ratios—demonstrating superior adaptability, robustness, and generalization.

Technology Category

Application Category

📝 Abstract
Multimodal data is known to be helpful for visual tracking by improving robustness to appearance variations. However, sensor synchronization challenges often compromise data availability, particularly in video settings where shortages can be temporal. Despite its importance, this area remains underexplored. In this paper, we present the first comprehensive study on tracker performance with temporally incomplete multimodal data. Unsurprisingly, under such a circumstance, existing trackers exhibit significant performance degradation, as their rigid architectures lack the adaptability needed to effectively handle missing modalities. To address these limitations, we propose a flexible framework for robust multimodal tracking. We venture that a tracker should dynamically activate computational units based on missing data rates. This is achieved through a novel Heterogeneous Mixture-of-Experts fusion mechanism with adaptive complexity, coupled with a video-level masking strategy that ensures both temporal consistency and spatial completeness which is critical for effective video tracking. Surprisingly, our model not only adapts to varying missing rates but also adjusts to scene complexity. Extensive experiments show that our model achieves SOTA performance across 9 benchmarks, excelling in both conventional complete and missing modality settings. The code and benchmark will be publicly available at https://github.com/supertyd/FlexTrack/tree/main.
Problem

Research questions and friction points this paper is trying to address.

Study tracker performance with incomplete multimodal data
Propose flexible framework for robust multimodal tracking
Achieve SOTA performance in complete and missing modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneous Mixture-of-Experts fusion mechanism
Adaptive complexity for missing modalities
Video-level masking for temporal consistency
🔎 Similar Papers
No similar papers found.