GLAD: Generative Language-Assisted Visual Tracking for Low-Semantic Templates

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the degradation in cross-modal alignment caused by low-semantic template images—such as blurry or low-resolution frames—in vision-language tracking, which adversely impacts target localization accuracy. To mitigate this issue, the paper introduces diffusion models into the task for the first time, proposing a generative language-assisted visual tracking approach. By fusing textual descriptions with template images through a diffusion process in latent space, the method enhances semantic compatibility between modalities, effectively alleviating modality misalignment under low-semantic conditions. Built upon a Transformer architecture, the framework enables efficient multimodal generative modeling. It achieves state-of-the-art performance across multiple benchmarks, significantly improving tracking accuracy with degraded or semantically weak templates while maintaining high inference efficiency.

Technology Category

Application Category

📝 Abstract
Vision-language tracking has gained increasing attention in many scenarios. This task simultaneously deals with visual and linguistic information to localize objects in videos. Despite its growing utility, the development of vision-language tracking methods remains in its early stage. Current vision-language trackers usually employ Transformer architectures for interactive integration of template, search, and text features. However, persistent challenges about low-semantic images including prevalent image blurriness, low resolution and so on, may compromise model performance through degraded cross-modal understanding. To solve this problem, language assistance is usually used to deal with the obstacles posed by low-semantic images. However, due to the existing gap between current textual and visual features, direct concatenation and fusion of these features may have limited effectiveness. To address these challenges, we introduce a pioneering Generative Language-AssisteD tracking model, GLAD, which utilizes diffusion models for the generative multi-modal fusion of text description and template image to bolster compatibility between language and image and enhance template image semantic information. Our approach demonstrates notable improvements over the existing fusion paradigms. Blurry and semantically ambiguous template images can be restored to improve multi-modal features in the generative fusion paradigm. Experiments show that our method establishes a new state-of-the-art on multiple benchmarks and achieves an impressive inference speed. The code and models will be released at: https://github.com/Confetti-lxy/GLAD
Problem

Research questions and friction points this paper is trying to address.

vision-language tracking
low-semantic templates
cross-modal understanding
feature fusion
template image degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion models
generative fusion
vision-language tracking
low-semantic templates
multimodal compatibility
🔎 Similar Papers
No similar papers found.
X
Xingyu Luo
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China
Yidong Cai
Yidong Cai
Nanjing University
Jie Liu
Jie Liu
Nanjing University
Jie Tang
Jie Tang
UW Madison
Computed Tomography
G
Gangshan Wu
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China
Limin Wang
Limin Wang
Nanjing University
Computer VisionAction RecognitionVideo Understanding