Training-Free Semantic Multi-Object Tracking with Vision-Language Models

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

This work addresses the limitations of existing semantic multi-object tracking (SMOT) approaches, which rely on end-to-end training, require extensive annotated data, and lack flexibility in adapting to new models or interaction types. To overcome these challenges, we propose the first fully training-free SMOT framework, which modularly integrates off-the-shelf pretrained components—including the detector D-FINE, the segmentation tracker SAM2, and the video-language model InternVideo2.5—and introduces a novel pipeline for automatic conversion of object trajectories into human-interpretable semantic descriptions. This pipeline leverages large language model (LLM)-based word-sense disambiguation and WordNet synset alignment. Evaluated on the BenSMOT benchmark, our method achieves state-of-the-art tracking performance while significantly improving the quality of video summarization and instance-level descriptions, establishing an efficient and scalable new paradigm for semantic multi-object tracking.

Technology Category

Application Category

📝 Abstract

Semantic Multi-Object Tracking (SMOT) extends multi-object tracking with semantic outputs such as video summaries, instance-level captions, and interaction labels, aiming to move from trajectories to human-interpretable descriptions of dynamic scenes. Existing SMOT systems are trained end-to-end, coupling progress to expensive supervision, limiting the ability to rapidly adapt to new foundation models and new interactions. We propose TF-SMOT, a training-free SMOT pipeline that composes pretrained components for detection, mask-based tracking, and video-language generation. TF-SMOT combines D-FINE and the promptable SAM2 segmentation tracker to produce temporally consistent tracklets, uses contour grounding to generate video summaries and instance captions with InternVideo2.5, and aligns extracted interaction predicates to BenSMOT WordNet synsets via gloss-based semantic retrieval with LLM disambiguation. On BenSMOT, TF-SMOT achieves state-of-the-art tracking performance within the SMOT setting and improves summary and caption quality compared to prior art. Interaction recognition, however, remains challenging under strict exact-match evaluation on the fine-grained and long-tailed WordNet label space; our analysis and ablations indicate that semantic overlap and label granularity substantially affect measured performance.

Problem

Research questions and friction points this paper is trying to address.

Semantic Multi-Object Tracking

training-free

vision-language models

interaction recognition

foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-Free

Semantic Multi-Object Tracking

Vision-Language Models