QTrack: Query-Driven Reasoning for Multi-modal MOT

📅 2026-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing multi-object tracking methods struggle to selectively track targets specified by natural language instructions. We propose a novel query-driven paradigm, formulating the task as a language-guided spatiotemporal reasoning problem: given a video sequence and a textual query, the model must localize and consistently track only those objects semantically matching the query while preserving temporal coherence and identity consistency. To this end, we introduce RMOT26, a new benchmark designed to prevent identity leakage, and present QTrack, an end-to-end vision-language model that integrates multimodal reasoning with a temporal-aware policy optimization mechanism. Experiments demonstrate that our approach significantly outperforms existing methods on RMOT26, validating the effectiveness of a language-guided, reasoning-centric tracking framework.

Technology Category

Application Category

📝 Abstract
Multi-object tracking (MOT) has traditionally focused on estimating trajectories of all objects in a video, without selectively reasoning about user-specified targets under semantic instructions. In this work, we introduce a query-driven tracking paradigm that formulates tracking as a spatiotemporal reasoning problem conditioned on natural language queries. Given a reference frame, a video sequence, and a textual query, the goal is to localize and track only the target(s) specified in the query while maintaining temporal coherence and identity consistency. To support this setting, we construct RMOT26, a large-scale benchmark with grounded queries and sequence-level splits to prevent identity leakage and enable robust evaluation of generalization. We further present QTrack, an end-to-end vision-language model that integrates multimodal reasoning with tracking-oriented localization. Additionally, we introduce a Temporal Perception-Aware Policy Optimization strategy with structured rewards to encourage motion-aware reasoning. Extensive experiments demonstrate the effectiveness of our approach for reasoning-centric, language-guided tracking. Code and data are available at https://github.com/gaash-lab/QTrack
Problem

Research questions and friction points this paper is trying to address.

multi-object tracking
query-driven reasoning
vision-language model
temporal coherence
semantic instruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

query-driven tracking
vision-language model
multimodal reasoning
temporal perception-aware optimization
RMOT26 benchmark
🔎 Similar Papers
No similar papers found.
Tajamul Ashraf
Tajamul Ashraf
IIT Delhi, MBZUAI
Computer VisionDeep Learning
T
Tavaheed Tariq
Gaash Research Lab, National Institute of Technology Srinagar, India
S
Sonia Yadav
Gaash Research Lab, National Institute of Technology Srinagar, India
A
Abrar Ul Riyaz
Gaash Research Lab, National Institute of Technology Srinagar, India
W
Wasif Tak
Thapar Institute of Engineering and Technology, India
Moloud Abdar
Moloud Abdar
Senior Data Scientist, The University of Queensland, Australia
Machine LearningDeep LearningComputer VisionVision-Language ModelsSentiment Analysis
J
Janibul Bashir
Gaash Research Lab, National Institute of Technology Srinagar, India