🤖 AI Summary
To address the challenges of ambiguous inter-frame correspondence due to appearance ambiguity in LiDAR point cloud 3D single-object tracking—and the reliance of motion-based methods on complex segmentation and multi-stage pipelines—this paper proposes a Part-to-Part (P2P) motion modeling paradigm, eliminating conventional appearance matching and explicit segmentation. Our approach features dual pathways: an implicit point-level (P2P-point) and an explicit voxel-level (P2P-voxel) architecture. Both jointly enable fine-grained inter-frame motion modeling via point-cloud part alignment, cross-frame feature fusion, and end-to-end differentiable motion cue extraction. Evaluated on KITTI, nuScenes, and Waymo, P2P-voxel achieves 89%, 72%, and 63% tracking accuracy, respectively; P2P-point outperforms M²Track by +3.3% and +6.7% on KITTI and nuScenes, while operating at 107 FPS.
📝 Abstract
3D single object tracking (SOT) methods based on appearance matching has long suffered from insufficient appearance information incurred by incomplete, textureless and semantically deficient LiDAR point clouds. While motion paradigm exploits motion cues instead of appearance matching for tracking, it incurs complex multi-stage processing and segmentation module. In this paper, we first provide in-depth explorations on motion paradigm, which proves that ( extbf{i}) it is feasible to directly infer target relative motion from point clouds across consecutive frames; ( extbf{ii}) fine-grained information comparison between consecutive point clouds facilitates target motion modeling. We thereby propose to perform part-to-part motion modeling for consecutive point clouds and introduce a novel tracking framework, termed extbf{P2P}. The novel framework fuses each corresponding part information between consecutive point clouds, effectively exploring detailed information changes and thus modeling accurate target-related motion cues. Following this framework, we present P2P-point and P2P-voxel models, incorporating implicit and explicit part-to-part motion modeling by point- and voxel-based representation, respectively. Without bells and whistles, P2P-voxel sets a new state-of-the-art performance ($sim$ extbf{89%}, extbf{72%} and extbf{63%} precision on KITTI, NuScenes and Waymo Open Dataset, respectively). Moreover, under the same point-based representation, P2P-point outperforms the previous motion tracker M$^2$Track by extbf{3.3%} and extbf{6.7%} on the KITTI and NuScenes, while running at a considerably high speed of extbf{107 Fps} on a single RTX3090 GPU. The source code and pre-trained models are available at https://github.com/haooozi/P2P.