QTrack: Query-Driven Reasoning for Multi-modal MOT

📅 2026-03-14

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the challenge that existing multi-object tracking methods struggle to selectively track targets specified by natural language instructions. We propose a novel query-driven paradigm, formulating the task as a language-guided spatiotemporal reasoning problem: given a video sequence and a textual query, the model must localize and consistently track only those objects semantically matching the query while preserving temporal coherence and identity consistency. To this end, we introduce RMOT26, a new benchmark designed to prevent identity leakage, and present QTrack, an end-to-end vision-language model that integrates multimodal reasoning with a temporal-aware policy optimization mechanism. Experiments demonstrate that our approach significantly outperforms existing methods on RMOT26, validating the effectiveness of a language-guided, reasoning-centric tracking framework.

Technology Category

Application Category

📝 Abstract

Multi-object tracking (MOT) has traditionally focused on estimating trajectories of all objects in a video, without selectively reasoning about user-specified targets under semantic instructions. In this work, we introduce a query-driven tracking paradigm that formulates tracking as a spatiotemporal reasoning problem conditioned on natural language queries. Given a reference frame, a video sequence, and a textual query, the goal is to localize and track only the target(s) specified in the query while maintaining temporal coherence and identity consistency. To support this setting, we construct RMOT26, a large-scale benchmark with grounded queries and sequence-level splits to prevent identity leakage and enable robust evaluation of generalization. We further present QTrack, an end-to-end vision-language model that integrates multimodal reasoning with tracking-oriented localization. Additionally, we introduce a Temporal Perception-Aware Policy Optimization strategy with structured rewards to encourage motion-aware reasoning. Extensive experiments demonstrate the effectiveness of our approach for reasoning-centric, language-guided tracking. Code and data are available at https://github.com/gaash-lab/QTrack

Problem

Research questions and friction points this paper is trying to address.

multi-object tracking

query-driven reasoning

vision-language model

temporal coherence

semantic instruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

query-driven tracking

vision-language model

multimodal reasoning