🤖 AI Summary
To address the challenge of zero-shot, language-guided multi-object localization and tracking in complex real-world videos, this paper proposes a training-free, two-stage cross-modal retrieval framework. In the first stage, a frozen multimodal large model—LLaVA-Video—is leveraged for fine-grained vision-language alignment and spatiotemporal cue parsing. In the second stage, semantic query–guided localization outputs are integrated into the state-of-the-art tracker FastTracker to generate high-accuracy, robust multi-object trajectories. Critically, the method requires no fine-tuning or task-specific training. Evaluated on the MOT25-StAG benchmark, it achieves 20.68 m-HIoU and 10.73 HOTA, ranking second in the associated challenge—marking the first demonstration of end-to-end, large language model–driven zero-shot spatiotemporal localization and tracking of multiple objects.
📝 Abstract
In this report, we present our solution to the MOT25-Spatiotemporal Action Grounding (MOT25-StAG) Challenge. The aim of this challenge is to accurately localize and track multiple objects that match specific and free-form language queries, using video data of complex real-world scenes as input. We model the underlying task as a video retrieval problem and present a two-stage, zero-shot approach, combining the advantages of the SOTA tracking model FastTracker and Multi-modal Large Language Model LLaVA-Video. On the MOT25-StAG test set, our method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which won second place in the challenge.