๐ค AI Summary
Existing referential understanding methods struggle to model dynamic variations in object count and temporal states, while natural language-guided multi-object tracking (RMOT) faces bottlenecks including data scarcity, limited diversity, and poor generalization. To address these challenges, this paper proposes a language-guided self-bootstrapping paradigm. We introduce Refer-KITTI-V2โa large-scale, highly discriminative benchmark with 9,758 annotated frames and 617 fine-grained referring expressionsโand design an end-to-end differentiable temporal tracking framework. The framework integrates large language model (LLM)-based prompt generation, automatic discovery of language-discriminative tokens, and cross-modal temporal feature alignment. By decoupling manual annotation from model design, our approach overcomes inherent coupling limitations and significantly advances RMOT performance, surpassing current state-of-the-art methods. Both code and the Refer-KITTI-V2 dataset are publicly released to foster research in language-vision joint tracking.
๐ Abstract
Referring multi-object tracking (RMOT) aims at detecting and tracking multiple objects following human instruction represented by a natural language expression. Existing RMOT benchmarks are usually formulated through manual annotations, integrated with static regulations. This approach results in a dearth of notable diversity and a constrained scope of implementation. In this work, our key idea is to bootstrap the task of referring multi-object tracking by introducing discriminative language words as much as possible. In specific, we first develop Refer-KITTI into a large-scale dataset, named Refer-KITTI-V2. It starts with 2,719 manual annotations, addressing the issue of class imbalance and introducing more keywords to make it closer to real-world scenarios compared to Refer-KITTI. They are further expanded to a total of 9,758 annotations by prompting large language models, which create 617 different words, surpassing previous RMOT benchmarks. In addition, the end-to-end framework in RMOT is also bootstrapped by a simple yet elegant temporal advancement strategy, which achieves better performance than previous approaches. The source code and dataset is available at https://github.com/zyn213/TempRMOT.