Bootstrapping Referring Multi-Object Tracking

๐Ÿ“… 2024-06-07
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 14
โœจ Influential: 7
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing referential understanding methods struggle to model dynamic variations in object count and temporal states, while natural language-guided multi-object tracking (RMOT) faces bottlenecks including data scarcity, limited diversity, and poor generalization. To address these challenges, this paper proposes a language-guided self-bootstrapping paradigm. We introduce Refer-KITTI-V2โ€”a large-scale, highly discriminative benchmark with 9,758 annotated frames and 617 fine-grained referring expressionsโ€”and design an end-to-end differentiable temporal tracking framework. The framework integrates large language model (LLM)-based prompt generation, automatic discovery of language-discriminative tokens, and cross-modal temporal feature alignment. By decoupling manual annotation from model design, our approach overcomes inherent coupling limitations and significantly advances RMOT performance, surpassing current state-of-the-art methods. Both code and the Refer-KITTI-V2 dataset are publicly released to foster research in language-vision joint tracking.

Technology Category

Application Category

๐Ÿ“ Abstract
Referring multi-object tracking (RMOT) aims at detecting and tracking multiple objects following human instruction represented by a natural language expression. Existing RMOT benchmarks are usually formulated through manual annotations, integrated with static regulations. This approach results in a dearth of notable diversity and a constrained scope of implementation. In this work, our key idea is to bootstrap the task of referring multi-object tracking by introducing discriminative language words as much as possible. In specific, we first develop Refer-KITTI into a large-scale dataset, named Refer-KITTI-V2. It starts with 2,719 manual annotations, addressing the issue of class imbalance and introducing more keywords to make it closer to real-world scenarios compared to Refer-KITTI. They are further expanded to a total of 9,758 annotations by prompting large language models, which create 617 different words, surpassing previous RMOT benchmarks. In addition, the end-to-end framework in RMOT is also bootstrapped by a simple yet elegant temporal advancement strategy, which achieves better performance than previous approaches. The source code and dataset is available at https://github.com/zyn213/TempRMOT.
Problem

Research questions and friction points this paper is trying to address.

Localizing objects described in free-form natural language expressions
Tracking multiple objects dynamically across spatial and temporal states
Addressing limitations in language expressiveness for visual understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-automatic labeling pipeline generates diverse language prompts
Transformer-based framework enables long-term spatiotemporal object interactions
Query-driven Temporal Enhancement Module refines object representations
๐Ÿ”Ž Similar Papers
No similar papers found.