Adaptive Perception for Unified Visual Multi-modal Object Tracking

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing multimodal tracking methods overly rely on RGB modality, causing perceptual imbalance across modalities and hindering dynamic fusion of complementary information in complex scenarios—thereby limiting the generalization capability of unified models. To address this, we propose APTrack, the first unified tracker built upon an equal-modality modeling paradigm. APTrack introduces an Adaptive Modality Interaction (AMI) module that enables dynamic feature bridging via learnable cross-modal tokens. Furthermore, it employs modality-agnostic feature alignment and a shared Transformer architecture, supporting zero-shot transfer across RGB-T, RGB-D, and event-camera tracking tasks without fine-tuning. Evaluated on five major benchmarks—RGBT234, LasHeR, VisEvent, DepthTrack, and VOT-RGBD2022—APTrack consistently outperforms both state-of-the-art unified and modality-specific trackers, demonstrating substantial improvements in cross-modal robustness and generalization.

Technology Category

Application Category

📝 Abstract

Recently, many multi-modal trackers prioritize RGB as the dominant modality, treating other modalities as auxiliary, and fine-tuning separately various multi-modal tasks. This imbalance in modality dependence limits the ability of methods to dynamically utilize complementary information from each modality in complex scenarios, making it challenging to fully perceive the advantages of multi-modal. As a result, a unified parameter model often underperforms in various multi-modal tracking tasks. To address this issue, we propose APTrack, a novel unified tracker designed for multi-modal adaptive perception. Unlike previous methods, APTrack explores a unified representation through an equal modeling strategy. This strategy allows the model to dynamically adapt to various modalities and tasks without requiring additional fine-tuning between different tasks. Moreover, our tracker integrates an adaptive modality interaction (AMI) module that efficiently bridges cross-modality interactions by generating learnable tokens. Experiments conducted on five diverse multi-modal datasets (RGBT234, LasHeR, VisEvent, DepthTrack, and VOT-RGBD2022) demonstrate that APTrack not only surpasses existing state-of-the-art unified multi-modal trackers but also outperforms trackers designed for specific multi-modal tasks.

Problem

Research questions and friction points this paper is trying to address.

Unified multi-modal object tracking

Dynamic adaptation to modalities

Cross-modality interaction enhancement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multi-modal adaptive perception

Equal modeling strategy representation

Adaptive modality interaction module

🔎 Similar Papers

No similar papers found.

Authors to Follow