RGBT Tracking via All-layer Multimodal Interactions with Progressive Fusion Mamba

📅 2024-08-16

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

To address the limitations of single-layer cross-modal fusion in RGB-T tracking—namely, insufficient robustness and efficiency—this paper proposes AINet, a full-layer multimodal collaborative representation framework. Methodologically, it introduces three key innovations: (1) a Difference-Driven Linear-Complexity Mamba (DFM) module that models fine-grained discrepancies between RGB and thermal infrared features; (2) an Order-dynamic Fusion Mamba (OFM) with dynamically adaptive scanning orders to enable flexible inter-layer interaction; and (3) a progressive full-layer fusion strategy that orchestrates efficient and robust feature interaction from shallow to deep layers. Extensive experiments on four mainstream RGB-T benchmark datasets demonstrate consistent and significant improvements over state-of-the-art methods, validating the critical role of full-layer cross-modal interaction in enhancing tracking robustness under challenging scenarios.

Technology Category

Application Category

📝 Abstract

Existing RGBT tracking methods often design various interaction models to perform cross-modal fusion of each layer, but can not execute the feature interactions among all layers, which plays a critical role in robust multimodal representation, due to large computational burden. To address this issue, this paper presents a novel All-layer multimodal Interaction Network, named AINet, which performs efficient and effective feature interactions of all modalities and layers in a progressive fusion Mamba, for robust RGBT tracking. Even though modality features in different layers are known to contain different cues, it is always challenging to build multimodal interactions in each layer due to struggling in balancing interaction capabilities and efficiency. Meanwhile, considering that the feature discrepancy between RGB and thermal modalities reflects their complementary information to some extent, we design a Difference-based Fusion Mamba (DFM) to achieve enhanced fusion of different modalities with linear complexity. When interacting with features from all layers, a huge number of token sequences (3840 tokens in this work) are involved and the computational burden is thus large. To handle this problem, we design an Order-dynamic Fusion Mamba (OFM) to execute efficient and effective feature interactions of all layers by dynamically adjusting the scan order of different layers in Mamba. Extensive experiments on four public RGBT tracking datasets show that AINet achieves leading performance against existing state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

RGBT Tracking

Multi-modal Image Fusion

Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

AINet

Progressive Fusion Mamba (DFM)

Optimized Fusion Methodology (OFM)

🔎 Similar Papers

No similar papers found.