COVTrack++: Learning Open-Vocabulary Multi-Object Tracking from Continuous Videos via a Synergistic Paradigm

๐Ÿ“… 2026-03-25
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses key limitations in existing open-vocabulary multi-object tracking (MOT) methodsโ€”namely, fixed category vocabularies, the absence of continuously annotated data, and the difficulty of jointly optimizing detection and association. To overcome these challenges, we introduce C-TAO, the first continuously and densely annotated open-vocabulary MOT dataset, alongside COVTrack++, a novel framework that enables bidirectional mutual enhancement between detection and association through Multi-clue adaptive Fusion (MCF), Multi-granularity hierarchical Aggregation (MGA), and Temporal Confidence Propagation (TCP). Our approach achieves 35.4% and 30.5% novel TETA on the TAO validation and test sets, respectively, with notable improvements of 4.8% in novel AssocA and 5.8% in LocA. Furthermore, it demonstrates strong zero-shot generalization performance on BDD100K.

Technology Category

Application Category

๐Ÿ“ Abstract
Multi-Object Tracking (MOT) has traditionally focused on a few specific categories, restricting its applicability to real-world scenarios involving diverse objects. Open-Vocabulary Multi-Object Tracking (OVMOT) addresses this by enabling tracking of arbitrary categories, including novel objects unseen during training. However, current progress is constrained by two challenges: the lack of continuously annotated video data for training, and the lack of a customized OVMOT framework to synergistically handle detection and association. We address the data bottleneck by constructing C-TAO, the first continuously annotated training set for OVMOT, which increases annotation density by 26x over the original TAO and captures smooth motion dynamics and intermediate object states. For the framework bottleneck, we propose COVTrack++, a synergistic framework that achieves a bidirectional reciprocal mechanism between detection and association through three modules: (1) Multi-Cue Adaptive Fusion (MCF) dynamically balances appearance, motion, and semantic cues for association feature learning; (2) Multi-Granularity Hierarchical Aggregation (MGA) exploits hierarchical spatial relationships in dense detections, where visible child nodes (e.g., object parts) assist occluded parent objects (e.g., whole body) for association feature enhancement; (3) Temporal Confidence Propagation (TCP) recovers flickering detections through high-confidence tracked objects boosting low-confidence candidates across frames, stabilizing trajectories. Extensive experiments on TAO demonstrate state-of-the-art performance, with novel TETA reaching 35.4% and 30.5% on validation and test sets, improving novel AssocA by 4.8% and novel LocA by 5.8% over previous methods, and show strong zero-shot generalization on BDD100K. The code and dataset will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Open-Vocabulary Multi-Object Tracking
continuously annotated video data
detection and association synergy
multi-object tracking
novel object tracking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-Vocabulary Multi-Object Tracking
Continuously Annotated Dataset
Synergistic Detection-Association Framework
Multi-Cue Adaptive Fusion
Temporal Confidence Propagation
๐Ÿ”Ž Similar Papers
No similar papers found.