🤖 AI Summary
This work addresses the challenge of sparse 4D radar point clouds, which hinder effective environmental perception due to insufficient structural information. To overcome this limitation, the authors propose DRIFT, a novel model that achieves the first efficient joint modeling of local details and global context in 4D radar perception. DRIFT employs a dual-path Transformer-based architecture that simultaneously extracts point-level local features and pillar-voxel global representations, enhanced by a multi-stage feature sharing mechanism to enable deep interweaving fusion. Evaluated on the View-of-Delft (VoD) benchmark and an internal dataset, DRIFT significantly outperforms existing methods, achieving a state-of-the-art mAP of 52.6% for 3D object detection on VoD—substantially surpassing CenterPoint (45.4%).
📝 Abstract
4D radars, which provide 3D point cloud data along with Doppler velocity, are attractive components of modern automated driving systems due to their low cost and robustness under adverse weather conditions. However, they provide a significantly lower point cloud density than LiDAR sensors. This makes it important to exploit not only local but also global contextual scene information. This paper proposes DRIFT, a model that effectively captures and fuses both local and global contexts through a dual-path architecture. The model incorporates a point path to aggregate fine-grained local features and a pillar path to encode coarse-grained global features. These two parallel paths are intertwined via novel feature-sharing layers at multiple stages, enabling full utilization of both representations. DRIFT is evaluated on the widely used View-of-Delft (VoD) dataset and a proprietary internal dataset. It outperforms the baselines on the tasks of object detection and/or free road estimation. For example, DRIFT achieves a mean average precision (mAP) of 52.6\% (compared to, say, 45.4\% of CenterPoint) on the VoD dataset.