🤖 AI Summary
This work addresses the limitations of existing referring multi-object tracking methods, which rely solely on RGB images and struggle with spatial semantic understanding (e.g., “the person closest to the camera”) and robust target localization under severe occlusion. To overcome these challenges, we introduce depth information and propose a novel referring multi-object tracking task—Depth-aware Referring Multi-Object Tracking (DRMOT)—based on RGB-D-L (RGB, Depth, and Language) multimodal fusion. We construct DRSet, the first benchmark dataset for this task, comprising 187 scenes annotated with 240 referring expressions. Furthermore, we design DRTrack, an MLLM-guided framework that enables depth-aware object localization and trajectory association. Experimental results demonstrate that DRTrack significantly outperforms RGB-only approaches in both spatial semantic comprehension and tracking robustness.
📝 Abstract
Referring Multi-Object Tracking (RMOT) aims to track specific targets based on language descriptions and is vital for interactive AI systems such as robotics and autonomous driving. However, existing RMOT models rely solely on 2D RGB data, making it challenging to accurately detect and associate targets characterized by complex spatial semantics (e.g., ``the person closest to the camera'') and to maintain reliable identities under severe occlusion, due to the absence of explicit 3D spatial information. In this work, we propose a novel task, RGBD Referring Multi-Object Tracking (DRMOT), which explicitly requires models to fuse RGB, Depth (D), and Language (L) modalities to achieve 3D-aware tracking. To advance research on the DRMOT task, we construct a tailored RGBD referring multi-object tracking dataset, named DRSet, designed to evaluate models'spatial-semantic grounding and tracking capabilities. Specifically, DRSet contains RGB images and depth maps from 187 scenes, along with 240 language descriptions, among which 56 descriptions incorporate depth-related information. Furthermore, we propose DRTrack, a MLLM-guided depth-referring tracking framework. DRTrack performs depth-aware target grounding from joint RGB-D-L inputs and enforces robust trajectory association by incorporating depth cues. Extensive experiments on the DRSet dataset demonstrate the effectiveness of our framework.