ATSTrack: Enhancing Visual-Language Tracking by Aligning Temporal and Spatial Scales

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In vision-language tracking (VLT), inherent spatiotemporal misalignment between target motion-induced visual inputs and static linguistic descriptions impairs cross-modal feature alignment. This work is the first to systematically identify and address this scale mismatch problem. We propose a fine-grained spatiotemporal alignment framework comprising: (1) phrase-level linguistic feature decomposition to align with local visual regions; (2) a cross-modal spatiotemporal alignment module that explicitly models the discrepancy between visual frame rate and linguistic temporal granularity; and (3) vision-language guided tokens infused with historical linguistic context to enforce inter-frame semantic consistency. Built upon a Transformer architecture, our method jointly optimizes multimodal representations. Extensive experiments on mainstream benchmarks demonstrate state-of-the-art performance, validating that explicit spatiotemporal scale alignment is critical for enhancing VLT robustness.

Technology Category

Application Category

📝 Abstract
A main challenge of Visual-Language Tracking (VLT) is the misalignment between visual inputs and language descriptions caused by target movement. Previous trackers have explored many effective feature modification methods to preserve more aligned features. However, an important yet unexplored factor ultimately hinders their capability, which is the inherent differences in the temporal and spatial scale of information between visual and language inputs. To address this issue, we propose a novel visual-language tracker that enhances the effect of feature modification by extbf{A}ligning extbf{T}emporal and extbf{S}patial scale of different input components, named as extbf{ATSTrack}. Specifically, we decompose each language description into phrases with different attributes based on their temporal and spatial correspondence with visual inputs, and modify their features in a fine-grained manner. Moreover, we introduce a Visual-Language token that comprises modified linguistic information from the previous frame to guide the model to extract visual features that are more relevant to language description, thereby reducing the impact caused by the differences in spatial scale. Experimental results show that our proposed ATSTrack achieves performance comparable to existing methods. Our code will be released.
Problem

Research questions and friction points this paper is trying to address.

Misalignment between visual inputs and language descriptions in tracking
Inherent temporal and spatial scale differences in visual-language inputs
Need for fine-grained feature modification to enhance tracking accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns temporal and spatial scales
Decomposes language into attribute phrases
Uses Visual-Language token guidance
🔎 Similar Papers
No similar papers found.
Y
Yihao Zhen
Shenyang Institute of Automation, CAS
Q
Qiang Wang
School of Information Engineering, Shenyang University
Y
Yu Qiao
School of Software, Shandong University
Liangqiong Qu
Liangqiong Qu
The University of Hong Kong
Medical Image AnalysisImage SynthesisIllumination ModelingFederated Learning
Huijie Fan
Huijie Fan
Shenyang Institute of Automation, Chinese Academy of Sciences