UniSOT: A Unified Framework for Multi-Modality Single Object Tracking

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

Existing single-object tracking methods suffer from limitations in reference-video modality combinations (typically restricted to a single reference modality), fragmented model architectures, and poor generalization. To address these issues, this paper proposes the first unified multi-modal single-object tracking framework. It supports arbitrary combinations of three reference modalities (RGB, infrared, event) and four video modalities (RGB, infrared, event, depth). The framework employs a shared Transformer backbone, cross-modal feature alignment, and dynamic prompt fusion to enable parameter-efficient sharing and joint cross-modal inference. Evaluated on 18 benchmarks, it achieves state-of-the-art performance: +3.0% AUC on TNL2K and outperforms Un-Track by over 2.0% on RGB+X tasks—the primary metric for cross-modal tracking. These results demonstrate significantly enhanced model generality and deployment flexibility in real-world multi-sensor scenarios.

Technology Category

Application Category

📝 Abstract

Single object tracking aims to localize target object with specific reference modalities (bounding box, natural language or both) in a sequence of specific video modalities (RGB, RGB+Depth, RGB+Thermal or RGB+Event.). Different reference modalities enable various human-machine interactions, and different video modalities are demanded in complex scenarios to enhance tracking robustness. Existing trackers are designed for single or several video modalities with single or several reference modalities, which leads to separate model designs and limits practical applications. Practically, a unified tracker is needed to handle various requirements. To the best of our knowledge, there is still no tracker that can perform tracking with these above reference modalities across these video modalities simultaneously. Thus, in this paper, we present a unified tracker, UniSOT, for different combinations of three reference modalities and four video modalities with uniform parameters. Extensive experimental results on 18 visual tracking, vision-language tracking and RGB+X tracking benchmarks demonstrate that UniSOT shows superior performance against modality-specific counterparts. Notably, UniSOT outperforms previous counterparts by over 3.0% AUC on TNL2K across all three reference modalities and outperforms Un-Track by over 2.0% main metric across all three RGB+X video modalities.

Problem

Research questions and friction points this paper is trying to address.

Unifying multi-modality single object tracking across reference and video modalities

Addressing limitations of separate trackers for different modality combinations

Enabling robust tracking with uniform parameters across diverse scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for multi-modality single object tracking

Handles three reference and four video modalities simultaneously

Uniform parameters across all modality combinations

🔎 Similar Papers

No similar papers found.