Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge

📅 2025-11-05

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

To address the challenge of zero-shot, language-guided multi-object localization and tracking in complex real-world videos, this paper proposes a training-free, two-stage cross-modal retrieval framework. In the first stage, a frozen multimodal large model—LLaVA-Video—is leveraged for fine-grained vision-language alignment and spatiotemporal cue parsing. In the second stage, semantic query–guided localization outputs are integrated into the state-of-the-art tracker FastTracker to generate high-accuracy, robust multi-object trajectories. Critically, the method requires no fine-tuning or task-specific training. Evaluated on the MOT25-StAG benchmark, it achieves 20.68 m-HIoU and 10.73 HOTA, ranking second in the associated challenge—marking the first demonstration of end-to-end, large language model–driven zero-shot spatiotemporal localization and tracking of multiple objects.

Technology Category

Application Category

📝 Abstract

In this report, we present our solution to the MOT25-Spatiotemporal Action Grounding (MOT25-StAG) Challenge. The aim of this challenge is to accurately localize and track multiple objects that match specific and free-form language queries, using video data of complex real-world scenes as input. We model the underlying task as a video retrieval problem and present a two-stage, zero-shot approach, combining the advantages of the SOTA tracking model FastTracker and Multi-modal Large Language Model LLaVA-Video. On the MOT25-StAG test set, our method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which won second place in the challenge.

Problem

Research questions and friction points this paper is trying to address.

Localizing and tracking multiple objects using language queries

Processing video data from complex real-world scenes

Developing training-free solutions for spatiotemporal action grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage zero-shot approach combining tracking and vision-language

Leverages FastTracker for object tracking in videos

Uses LLaVA-Video for language query understanding

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs