ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing vision-language multi-object tracking methods, which are constrained by the narrow field of view of conventional cameras and prone to target loss and contextual discontinuity. To overcome this, we introduce panoramic referring multi-object tracking (ORMOT)—the first formulation of referring multi-object tracked in 360° panoramic scenes—and present ORSet, the first large-scale multimodal dataset comprising 27 panoramic scenes, 848 natural language expressions, and 3,401 annotated objects. We further propose ORTrack, a novel framework built upon large vision-language models that integrates panoramic image understanding, cross-modal alignment, and temporal tracking mechanisms. Experimental results demonstrate that ORTrack significantly improves the robustness and accuracy of language-guided multi-object tracking in panoramic environments on the ORSet benchmark.

Technology Category

Application Category

📝 Abstract
Multi-Object Tracking (MOT) is a fundamental task in computer vision, aiming to track targets across video frames. Existing MOT methods perform well in general visual scenes, but face significant challenges and limitations when extended to visual-language settings. To bridge this gap, the task of Referring Multi-Object Tracking (RMOT) has recently been proposed, which aims to track objects that correspond to language descriptions. However, current RMOT methods are primarily developed on datasets captured by conventional cameras, which suffer from limited field of view. This constraint often causes targets to move out of the frame, leading to fragmented tracking and loss of contextual information. In this work, we propose a novel task, called Omnidirectional Referring Multi-Object Tracking (ORMOT), which extends RMOT to omnidirectional imagery, aiming to overcome the field-of-view (FoV) limitation of conventional datasets and improve the model's ability to understand long-horizon language descriptions. To advance the ORMOT task, we construct ORSet, an Omnidirectional Referring Multi-Object Tracking dataset, which contains 27 diverse omnidirectional scenes, 848 language descriptions, and 3,401 annotated objects, providing rich visual, temporal, and language information. Furthermore, we propose ORTrack, a Large Vision-Language Model (LVLM)-driven framework tailored for Omnidirectional Referring Multi-Object Tracking. Extensive experiments on the ORSet dataset demonstrate the effectiveness of our ORTrack framework. The dataset and code will be open-sourced at https://github.com/chen-si-jia/ORMOT.
Problem

Research questions and friction points this paper is trying to address.

Omnidirectional
Referring Multi-Object Tracking
Field of View
Visual-Language
Multi-Object Tracking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Omnidirectional Vision
Referring Multi-Object Tracking
Large Vision-Language Model
360-degree Video
Visual-Language Understanding
🔎 Similar Papers
No similar papers found.
S
Sijia Chen
State Key Laboratory of Multispectral Information Intelligent Processing Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, 430074, Hubei, China
Zihan Zhou
Zihan Zhou
South China University of Technology
Computer Vision,Image Processing,Deep Learning
Y
Yanqiu Yu
State Key Laboratory of Multispectral Information Intelligent Processing Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, 430074, Hubei, China
E
En Yu
State Key Laboratory of Multispectral Information Intelligent Processing Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, 430074, Hubei, China
Wenbing Tao
Wenbing Tao
Professor of School of Automation, Huazhong University of Science and Technology
image processingcomputer visionpattern recognition