Exploring Modality-Aware Fusion and Decoupled Temporal Propagation for Multi-Modal Object Tracking

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal object tracking methods employ a uniform fusion strategy that overlooks modality-specific characteristics and propagates temporal information through mixed tokens, leading to entangled representations and insufficient discriminability. To address these limitations, this work proposes MDTrack, a novel framework that introduces, for the first time, modality-dedicated experts and a decoupled dual-stream state space model. Specifically, separate expert modules are assigned to RGB and X modalities and dynamically selected based on input content, while independent state space models capture the distinct temporal dynamics of each modality. Cross-modal collaboration is achieved through cross-attention mechanisms. Extensive experiments demonstrate that MDTrack achieves state-of-the-art performance across five multimodal tracking benchmarks, significantly improving both tracking accuracy and robustness.

Technology Category

Application Category

📝 Abstract
Most existing multimodal trackers adopt uniform fusion strategies, overlooking the inherent differences between modalities. Moreover, they propagate temporal information through mixed tokens, leading to entangled and less discriminative temporal representations. To address these limitations, we propose MDTrack, a novel framework for modality aware fusion and decoupled temporal propagation in multimodal object tracking. Specifically, for modality aware fusion, we allocate dedicated experts to each modality, including infrared, event, depth, and RGB, to process their respective representations. The gating mechanism within the Mixture of Experts dynamically selects the optimal experts based on the input features, enabling adaptive and modality specific fusion. For decoupled temporal propagation, we introduce two separate State Space Model structures to independently store and update the hidden states of the RGB and X modal streams, effectively capturing their distinct temporal information. To ensure synergy between the two temporal representations, we incorporate a set of cross attention modules between the input features of the two SSMs, facilitating implicit information exchange. The resulting temporally enriched features are then integrated into the backbone through another set of cross attention modules, enhancing MDTrack's ability to leverage temporal information. Extensive experiments demonstrate the effectiveness of our proposed method. Both MDTrack S and MDTrack U achieve state of the art performance across five multimodal tracking benchmarks.
Problem

Research questions and friction points this paper is trying to address.

multimodal object tracking
modality-aware fusion
temporal propagation
feature entanglement
discriminative representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

modality-aware fusion
decoupled temporal propagation
Mixture of Experts
State Space Model
cross-attention
🔎 Similar Papers
No similar papers found.