STORM: End-to-End Referring Multi-Object Tracking in Videos

📅 2026-04-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
This work addresses the challenge of text-based multi-object tracking in videos, a task hindered by fragmented model architectures, scarce annotated data, ambiguous labels, and poor domain generalization. To overcome these limitations, we propose STORM, an end-to-end multimodal large language model that jointly performs object grounding and tracking within a unified framework, eliminating reliance on external detectors. STORM integrates visual appearance, motion dynamics, and linguistic cues for consistent spatio-temporal reasoning. We introduce a novel task-composition learning strategy that decomposes the complex tracking problem into image grounding and object tracking subtasks to enhance data efficiency. Additionally, we construct STORM-Bench, a high-quality benchmark for comprehensive evaluation. Experiments demonstrate that STORM achieves state-of-the-art performance across multiple benchmarks—including image grounding, single-object tracking, and RMOT—exhibiting strong generalization and robust spatio-temporal localization in complex scenarios.

Technology Category

Application Category

📝 Abstract
Referring multi-object tracking (RMOT) is a task of associating all the objects in a video that semantically match with given textual queries or referring expressions. Existing RMOT approaches decompose object grounding and tracking into separated modules and exhibit limited performance due to the scarcity of training videos, ambiguous annotations, and restricted domains. In this work, we introduce STORM, an end-to-end MLLM that jointly performs grounding and tracking within a unified framework, eliminating external detectors and enabling coherent reasoning over appearance, motion, and language. To improve data efficiency, we propose a task-composition learning (TCL) strategy that decomposes RMOT into image grounding and object tracking, allowing STORM to leverage data-rich sub-tasks and learn structured spatial--temporal reasoning. We further construct STORM-Bench, a new RMOT dataset with accurate trajectories and diverse, unambiguous referring expressions generated through a bottom-up annotation pipeline. Extensive experiments show that STORM achieves state-of-the-art performance on image grounding, single-object tracking, and RMOT benchmarks, demonstrating strong generalization and robust spatial--temporal grounding in complex real-world scenarios. STORM-Bench is released at https://github.com/amazon-science/storm-referring-multi-object-grounding.
Problem

Research questions and friction points this paper is trying to address.

Referring Multi-Object Tracking
video grounding
text-to-video alignment
ambiguous annotations
data scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

end-to-end MLLM
referring multi-object tracking
task-composition learning
spatial-temporal grounding
unified grounding and tracking
🔎 Similar Papers
No similar papers found.