RT-RMOT: A Dataset and Framework for RGB-Thermal Referring Multi-Object Tracking

πŸ“… 2026-02-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the performance degradation of referring multi-object tracking (RMOT) in low-visibility scenarios such as nighttime or smoky conditions. To this end, it introduces RT-RMOT, a novel RMOT task that fuses RGB and thermal imaging modalities, and presents RefRTβ€”the first RGB-thermal referring tracking dataset. Building upon this foundation, the authors propose RTrack, a framework leveraging a multimodal large language model, enhanced with Group Sequence Policy Optimization (GSPO) and Clipped Advantage Scaling (CAS) training strategies, and fine-tuned via reinforcement learning with a structured reward mechanism. Experimental results demonstrate that RTrack significantly improves tracking accuracy and robustness under challenging illumination conditions on RefRT, validating the efficacy of multimodal fusion and reinforcement learning optimization for all-weather referring tracking.

Technology Category

Application Category

πŸ“ Abstract
Referring Multi-Object Tracking has attracted increasing attention due to its human-friendly interactive characteristics, yet it exhibits limitations in low-visibility conditions, such as nighttime, smoke, and other challenging scenarios. To overcome this limitation, we propose a new RGB-Thermal RMOT task, named RT-RMOT, which aims to fuse RGB appearance features with the illumination robustness of the thermal modality to enable all-day referring multi-object tracking. To promote research on RT-RMOT, we construct the first Referring Multi-Object Tracking dataset under RGB-Thermal modality, named RefRT. It contains 388 language descriptions, 1,250 tracked targets, and 166,147 Language-RGB-Thermal (L-RGB-T) triplets. Furthermore, we propose RTrack, a framework built upon a multimodal large language model (MLLM) that integrates RGB, thermal, and textual features. Since the initial framework still leaves room for improvement, we introduce a Group Sequence Policy Optimization (GSPO) strategy to further exploit the model's potential. To alleviate training instability during RL fine-tuning, we introduce a Clipped Advantage Scaling (CAS) strategy to suppress gradient explosion. In addition, we design Structured Output Reward and Comprehensive Detection Reward to balance exploration and exploitation, thereby improving the completeness and accuracy of target perception. Extensive experiments on the RefRT dataset demonstrate the effectiveness of the proposed RTrack framework.
Problem

Research questions and friction points this paper is trying to address.

Referring Multi-Object Tracking
RGB-Thermal
low-visibility
all-day tracking
multimodal perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

RGB-Thermal Fusion
Referring Multi-Object Tracking
Multimodal Large Language Model
Reinforcement Learning Optimization
Group Sequence Policy Optimization
πŸ”Ž Similar Papers
No similar papers found.
Y
Yanqiu Yu
Huazhong University of Science and Technology, Wuhan, Hubei, China
Z
Zhifan Jin
South-Central Minzu University, Wuhan, Hubei, China
S
Sijia Chen
Huazhong University of Science and Technology, Wuhan, Hubei, China
T
Tongfei Chu
Huazhong University of Science and Technology, Wuhan, Hubei, China
E
En Yu
Huazhong University of Science and Technology, Wuhan, Hubei, China
L
Liman Liu
South-Central Minzu University, Wuhan, Hubei, China
Wenbing Tao
Wenbing Tao
Professor of School of Automation, Huazhong University of Science and Technology
image processingcomputer visionpattern recognition