🤖 AI Summary
To address degraded 3D single-object tracking performance caused by sparse and incomplete point clouds in autonomous driving and robotics, this paper proposes a Multimodal-guided Virtual Clue Projection (MVCP) mechanism. MVCP leverages 2D RGB detection outputs for the first time, employing cross-modal feature alignment and differentiable depth completion to synthesize dense, geometrically consistent 3D virtual points, which are seamlessly integrated into a Transformer-based LiDAR point cloud tracking framework. Crucially, MVCP requires no modification to the backbone network and is fully compatible with existing tracking architectures. Evaluated on the nuScenes dataset, our method significantly improves tracking accuracy and robustness under sparse-scene conditions, achieving state-of-the-art performance across multiple metrics. This validates the effectiveness of virtual clues in compensating for geometric deficiencies inherent in real-world LiDAR data.
📝 Abstract
3D single object tracking is essential in autonomous driving and robotics. Existing methods often struggle with sparse and incomplete point cloud scenarios. To address these limitations, we propose a Multimodal-guided Virtual Cues Projection (MVCP) scheme that generates virtual cues to enrich sparse point clouds. Additionally, we introduce an enhanced tracker MVCTrack based on the generated virtual cues. Specifically, the MVCP scheme seamlessly integrates RGB sensors into LiDAR-based systems, leveraging a set of 2D detections to create dense 3D virtual cues that significantly improve the sparsity of point clouds. These virtual cues can naturally integrate with existing LiDAR-based 3D trackers, yielding substantial performance gains. Extensive experiments demonstrate that our method achieves competitive performance on the NuScenes dataset.