COVTrack++: Learning Open-Vocabulary Multi-Object Tracking from Continuous Videos via a Synergistic Paradigm

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses key limitations in existing open-vocabulary multi-object tracking (MOT) methods—namely, fixed category vocabularies, the absence of continuously annotated data, and the difficulty of jointly optimizing detection and association. To overcome these challenges, we introduce C-TAO, the first continuously and densely annotated open-vocabulary MOT dataset, alongside COVTrack++, a novel framework that enables bidirectional mutual enhancement between detection and association through Multi-clue adaptive Fusion (MCF), Multi-granularity hierarchical Aggregation (MGA), and Temporal Confidence Propagation (TCP). Our approach achieves 35.4% and 30.5% novel TETA on the TAO validation and test sets, respectively, with notable improvements of 4.8% in novel AssocA and 5.8% in LocA. Furthermore, it demonstrates strong zero-shot generalization performance on BDD100K.

Technology Category

Application Category

📝 Abstract

Multi-Object Tracking (MOT) has traditionally focused on a few specific categories, restricting its applicability to real-world scenarios involving diverse objects. Open-Vocabulary Multi-Object Tracking (OVMOT) addresses this by enabling tracking of arbitrary categories, including novel objects unseen during training. However, current progress is constrained by two challenges: the lack of continuously annotated video data for training, and the lack of a customized OVMOT framework to synergistically handle detection and association. We address the data bottleneck by constructing C-TAO, the first continuously annotated training set for OVMOT, which increases annotation density by 26x over the original TAO and captures smooth motion dynamics and intermediate object states. For the framework bottleneck, we propose COVTrack++, a synergistic framework that achieves a bidirectional reciprocal mechanism between detection and association through three modules: (1) Multi-Cue Adaptive Fusion (MCF) dynamically balances appearance, motion, and semantic cues for association feature learning; (2) Multi-Granularity Hierarchical Aggregation (MGA) exploits hierarchical spatial relationships in dense detections, where visible child nodes (e.g., object parts) assist occluded parent objects (e.g., whole body) for association feature enhancement; (3) Temporal Confidence Propagation (TCP) recovers flickering detections through high-confidence tracked objects boosting low-confidence candidates across frames, stabilizing trajectories. Extensive experiments on TAO demonstrate state-of-the-art performance, with novel TETA reaching 35.4% and 30.5% on validation and test sets, improving novel AssocA by 4.8% and novel LocA by 5.8% over previous methods, and show strong zero-shot generalization on BDD100K. The code and dataset will be publicly available.

Problem

Research questions and friction points this paper is trying to address.

Open-Vocabulary Multi-Object Tracking

continuously annotated video data

detection and association synergy

multi-object tracking

novel object tracking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-Vocabulary Multi-Object Tracking

Continuously Annotated Dataset

Synergistic Detection-Association Framework