All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

📅 2023-07-07

🏛️ ACM Multimedia

📈 Citations: 12

✨ Influential: 1

career value

172K/year

🤖 AI Summary

Existing vision-language tracking methods adopt a decoupled feature extraction and post-fusion paradigm, resulting in weak semantic guidance and insufficient target perception—particularly under challenging conditions such as distractors with high visual similarity or extreme illumination. To address this, we propose All-in-One, an end-to-end unified Transformer architecture that jointly processes raw visual and linguistic inputs to generate language-injected visual tokens, thereby integrating feature extraction and cross-modal interaction into a single coherent process. Furthermore, we introduce a dual-level contrastive alignment module—operating both cross-modally and intra-modally—to strengthen semantic grounding and target discriminability. Extensive evaluations on five major benchmarks—including OTB99-L, TNL2K, and LaSOT—demonstrate consistent superiority over state-of-the-art methods, with significant improvements in tracking robustness and accuracy under complex scenarios.

📝 Abstract

Current mainstream vision-language (VL) tracking framework consists of three parts,i.e., a visual feature extractor, a language feature extractor, and a fusion model. To pursue better performance, a natural modus operandi for VL tracking is employing customized and heavier unimodal encoders, and multi-modal fusion models. Albeit effective, existing VL trackers separate feature extraction and feature integration, resulting in extracted features that lack semantic guidance and have limited target-aware capability in complex scenarios, e.g., similar distractors and extreme illumination. In this work, inspired by the recent success of exploring foundation models with unified architecture for both natural language and computer vision tasks, we propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone. Specifically, we mix raw vision and language signals to generate language-injected vision tokens, which we then concatenate before feeding into the unified backbone architecture. This approach achieves feature integration in a unified backbone, removing the need for carefully-designed fusion modules and resulting in a more effective and efficient VL tracking framework. To further improve the learning efficiency, we introduce a multi-modal alignment module based on cross-modal and intra-modal contrastive objectives, providing more reasonable representations for the unified All-in-One transformer backbone. Extensive experiments on five benchmarks, i.e., OTB99-L, TNL2K, LaSOT, LaSOTExt and WebUAV-3M, demonstrate the superiority of the proposed tracker against existing state-of-the-art (SOTA) methods on VL tracking. Codes will be available at https://github.com/983632847/All-in-One here.

Problem

Research questions and friction points this paper is trying to address.

Unified vision-language tracking with multi-modal alignment.

Eliminates separate feature extraction and integration in VL tracking.

Improves target-aware capability in complex tracking scenarios.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified transformer backbone for joint feature extraction.

Language-injected vision tokens enhance semantic guidance.

Multi-modal alignment improves learning efficiency and representation.

🔎 Similar Papers

No similar papers found.