Adaptive Perception for Unified Visual Multi-modal Object Tracking

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal tracking methods overly rely on RGB modality, causing perceptual imbalance across modalities and hindering dynamic fusion of complementary information in complex scenarios—thereby limiting the generalization capability of unified models. To address this, we propose APTrack, the first unified tracker built upon an equal-modality modeling paradigm. APTrack introduces an Adaptive Modality Interaction (AMI) module that enables dynamic feature bridging via learnable cross-modal tokens. Furthermore, it employs modality-agnostic feature alignment and a shared Transformer architecture, supporting zero-shot transfer across RGB-T, RGB-D, and event-camera tracking tasks without fine-tuning. Evaluated on five major benchmarks—RGBT234, LasHeR, VisEvent, DepthTrack, and VOT-RGBD2022—APTrack consistently outperforms both state-of-the-art unified and modality-specific trackers, demonstrating substantial improvements in cross-modal robustness and generalization.

Technology Category

Application Category

📝 Abstract
Recently, many multi-modal trackers prioritize RGB as the dominant modality, treating other modalities as auxiliary, and fine-tuning separately various multi-modal tasks. This imbalance in modality dependence limits the ability of methods to dynamically utilize complementary information from each modality in complex scenarios, making it challenging to fully perceive the advantages of multi-modal. As a result, a unified parameter model often underperforms in various multi-modal tracking tasks. To address this issue, we propose APTrack, a novel unified tracker designed for multi-modal adaptive perception. Unlike previous methods, APTrack explores a unified representation through an equal modeling strategy. This strategy allows the model to dynamically adapt to various modalities and tasks without requiring additional fine-tuning between different tasks. Moreover, our tracker integrates an adaptive modality interaction (AMI) module that efficiently bridges cross-modality interactions by generating learnable tokens. Experiments conducted on five diverse multi-modal datasets (RGBT234, LasHeR, VisEvent, DepthTrack, and VOT-RGBD2022) demonstrate that APTrack not only surpasses existing state-of-the-art unified multi-modal trackers but also outperforms trackers designed for specific multi-modal tasks.
Problem

Research questions and friction points this paper is trying to address.

Unified multi-modal object tracking
Dynamic adaptation to modalities
Cross-modality interaction enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multi-modal adaptive perception
Equal modeling strategy representation
Adaptive modality interaction module
🔎 Similar Papers
No similar papers found.
Xiantao Hu
Xiantao Hu
Nanjing University of Science & Technology
Computer VIsion
B
Bineng Zhong
Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, Guilin 541004, China
Q
Qihua Liang
Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, Guilin 541004, China
Z
Zhiyi Mo
School of Data Science and Software Engineering, Wuzhou University, Wuzhou 543002, China
L
Liangtao Shi
Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, Guilin 541004, China
Y
Ying Tai
School of Intelligence Science and Technology, Nanjing University, Nanjing 210008, China
J
Jian Yang
PCA-Lab, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China