MTNet: Learning modality-aware representation with transformer for RGBT tracking

📅 2025-08-24

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

RGBT tracking faces key challenges including insufficient robustness of multimodal representations, limited cross-modal interaction under conventional fusion paradigms, and reduced temporal adaptability due to fixed templates. To address these, we propose a modality-aware Transformer-based fusion framework. Specifically, we design a Channel-Aggregated Distribution Module (CADM) and a Spatial Similarity Perception Module (SSPM) for fine-grained, modality-specific representation learning; introduce a triple-branch prediction head and dynamic template updating strategy to enhance instance discriminability and temporal consistency; and construct a lightweight Transformer fusion network that explicitly models long-range cross-modal dependencies. Our method achieves state-of-the-art performance on three standard benchmarks—RGBT210, RGBT234, and GTOT—while maintaining real-time inference speed (>30 FPS), demonstrating superior balance among accuracy, robustness, and efficiency.

Technology Category

Application Category

📝 Abstract

The ability to learn robust multi-modality representation has played a critical role in the development of RGBT tracking. However, the regular fusion paradigm and the invariable tracking template remain restrictive to the feature interaction. In this paper, we propose a modality-aware tracker based on transformer, termed MTNet. Specifically, a modality-aware network is presented to explore modality-specific cues, which contains both channel aggregation and distribution module(CADM) and spatial similarity perception module (SSPM). A transformer fusion network is then applied to capture global dependencies to reinforce instance representations. To estimate the precise location and tackle the challenges, such as scale variation and deformation, we design a trident prediction head and a dynamic update strategy which jointly maintain a reliable template for facilitating inter-frame communication. Extensive experiments validate that the proposed method achieves satisfactory results compared with the state-of-the-art competitors on three RGBT benchmarks while reaching real-time speed.

Problem

Research questions and friction points this paper is trying to address.

Learning robust multi-modality representation for RGBT tracking

Overcoming limitations in feature interaction and fusion paradigms

Addressing scale variation and deformation challenges in tracking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-aware network with CADM and SSPM

Transformer fusion for global dependencies

Trident prediction head and dynamic update

🔎 Similar Papers

No similar papers found.