π€ AI Summary
This work addresses the performance degradation of referring multi-object tracking (RMOT) in low-visibility scenarios such as nighttime or smoky conditions. To this end, it introduces RT-RMOT, a novel RMOT task that fuses RGB and thermal imaging modalities, and presents RefRTβthe first RGB-thermal referring tracking dataset. Building upon this foundation, the authors propose RTrack, a framework leveraging a multimodal large language model, enhanced with Group Sequence Policy Optimization (GSPO) and Clipped Advantage Scaling (CAS) training strategies, and fine-tuned via reinforcement learning with a structured reward mechanism. Experimental results demonstrate that RTrack significantly improves tracking accuracy and robustness under challenging illumination conditions on RefRT, validating the efficacy of multimodal fusion and reinforcement learning optimization for all-weather referring tracking.
π Abstract
Referring Multi-Object Tracking has attracted increasing attention due to its human-friendly interactive characteristics, yet it exhibits limitations in low-visibility conditions, such as nighttime, smoke, and other challenging scenarios. To overcome this limitation, we propose a new RGB-Thermal RMOT task, named RT-RMOT, which aims to fuse RGB appearance features with the illumination robustness of the thermal modality to enable all-day referring multi-object tracking. To promote research on RT-RMOT, we construct the first Referring Multi-Object Tracking dataset under RGB-Thermal modality, named RefRT. It contains 388 language descriptions, 1,250 tracked targets, and 166,147 Language-RGB-Thermal (L-RGB-T) triplets. Furthermore, we propose RTrack, a framework built upon a multimodal large language model (MLLM) that integrates RGB, thermal, and textual features. Since the initial framework still leaves room for improvement, we introduce a Group Sequence Policy Optimization (GSPO) strategy to further exploit the model's potential. To alleviate training instability during RL fine-tuning, we introduce a Clipped Advantage Scaling (CAS) strategy to suppress gradient explosion. In addition, we design Structured Output Reward and Comprehensive Detection Reward to balance exploration and exploitation, thereby improving the completeness and accuracy of target perception. Extensive experiments on the RefRT dataset demonstrate the effectiveness of the proposed RTrack framework.