Breaking Shallow Limits: Task-Driven Pixel Fusion for Gap-free RGBT Tracking

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the substantial modality distribution discrepancy and weak discriminability of shallow pixel-level fusion in RGB-T tracking, this paper proposes a task-driven pixel-level fusion network (TPF). Methodologically: (1) a lightweight Pixel-level Fusion Adapter (PFA) is designed to achieve task-oriented modality alignment at shallow layers; (2) a progressive learning framework is introduced, integrating adaptive multi-expert distillation for initialization and decoupled representation learning to enhance discriminability of fused features; (3) a neighbor-aware dynamic template update mechanism is incorporated to improve robustness against appearance variations. Extensive experiments demonstrate that TPF achieves significant improvements over state-of-the-art methods on four mainstream RGB-T benchmark datasets, while maintaining real-time, low-latency performance. The source code will be made publicly available.

Technology Category

Application Category

📝 Abstract
Current RGBT tracking methods often overlook the impact of fusion location on mitigating modality gap, which is key factor to effective tracking. Our analysis reveals that shallower fusion yields smaller distribution gap. However, the limited discriminative power of shallow networks hard to distinguish task-relevant information from noise, limiting the potential of pixel-level fusion. To break shallow limits, we propose a novel extbf{T}ask-driven extbf{P}ixel-level extbf{F}usion network, named extbf{TPF}, which unveils the power of pixel-level fusion in RGBT tracking through a progressive learning framework. In particular, we design a lightweight Pixel-level Fusion Adapter (PFA) that exploits Mamba's linear complexity to ensure real-time, low-latency RGBT tracking. To enhance the fusion capabilities of the PFA, our task-driven progressive learning framework first utilizes adaptive multi-expert distillation to inherits fusion knowledge from state-of-the-art image fusion models, establishing robust initialization, and then employs a decoupled representation learning scheme to achieve task-relevant information fusion. Moreover, to overcome appearance variations between the initial template and search frames, we presents a nearest-neighbor dynamic template updating scheme, which selects the most reliable frame closest to the current search frame as the dynamic template. Extensive experiments demonstrate that TPF significantly outperforms existing most of advanced trackers on four public RGBT tracking datasets. The code will be released upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Addresses modality gap in RGBT tracking by optimizing fusion location.
Proposes a task-driven pixel-level fusion network for enhanced tracking performance.
Introduces a dynamic template updating scheme to handle appearance variations.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-driven Pixel-level Fusion network (TPF)
Lightweight Pixel-level Fusion Adapter (PFA)
Nearest-neighbor dynamic template updating scheme
🔎 Similar Papers
No similar papers found.
Andong Lu
Andong Lu
Anhui University
CV DL
Y
Yuanzhi Guo
School of Artificial Intelligence, Anhui University
W
Wanyu Wang
School of Artificial Intelligence, Anhui University
Chenglong Li
Chenglong Li
Professor, The University of Florida
Drug DesignDrug DiscoveryMolecular RecognitionMolecular ModelingProtein structure and Dynamics
Jin Tang
Jin Tang
Anhui University
Computer visionintelligent video analysis
B
Bin Luo
School of Computer Science and Technology, Anhui University