🤖 AI Summary
To address information redundancy or loss caused by point-wise sampling in LiDAR point cloud-based 3D single-object tracking, this paper proposes the first end-to-end tracking paradigm built upon pillar-based representation. Methodologically, we introduce (1) a Pyramid-Enhanced Pillar Feature Encoder (PE-PFE) to improve robustness against translation, rotation, and scale variations; (2) a modality-aware lightweight Transformer backbone for efficient cross-frame feature modeling; and (3) pillar grid-based representation—replacing point resampling—to preserve geometric fidelity. Extensive experiments on KITTI and nuScenes demonstrate state-of-the-art performance, significantly outperforming leading point-based baseline methods in both accuracy and efficiency.
📝 Abstract
LiDAR-based 3D single object tracking (3D SOT) is a critical issue in robotics and autonomous driving. Existing 3D SOT methods typically adhere to a point-based processing pipeline, wherein the re-sampling operation invariably leads to either redundant or missing information, thereby impacting performance. To address these issues, we propose PillarTrack, a novel pillar-based 3D SOT framework. First, we transform sparse point clouds into dense pillars to preserve the local and global geometrics. Second, we propose a Pyramid-Encoded Pillar Feature Encoder (PE-PFE) design to enhance the robustness of pillar feature for translation/rotation/scale. Third, we present an efficient Transformer-based backbone from the perspective of modality differences. Finally, we construct our PillarTrack based on above designs. Extensive experiments show that our method achieves comparable performance on the KITTI and NuScenes datasets, significantly enhancing the performance of the baseline.