🤖 AI Summary
Radar perception suffers from fragmented, task-specific methods with limited cross-task transferability. Method: This paper introduces the first radar-oriented foundation model, proposing a structured spatial-language supervision framework; incorporating hash-aware contrastive learning for fine-grained spatial reasoning; designing localization-aware evaluation metrics that transcend conventional detection paradigms; and integrating structured caption generation, vehicle-distribution encoding in radar coordinates, and CARLA-simulation-driven large-scale synthetic data generation. Contribution/Results: Experiments demonstrate substantial improvements in cross-task generalization and scene-level spatial understanding across diverse driving scenarios. The model establishes a scalable, generalizable, and unified modeling paradigm for radar perception—enabling robust transfer across tasks such as object detection, tracking, and semantic mapping without task-specific retraining.
📝 Abstract
Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions. Recent advances in foundation models have transformed visual and language understanding, yet their integration with radar sensing remains largely underexplored. Existing radar approaches are fragmented and task-specific; each downstream task employs distinct architectures and training objectives, preventing transfer across tasks. In this work, we introduce RadarFM: a radar foundation model that learns unified scene-level representations through structured spatial language supervision. We make two key contributions: (1) a structured caption framework that encodes vehicle distributions in native radar coordinates, and (2) a hash-aware contrastive learning objective that quantifies continuous scene similarity rather than binary matching, enabling fine-grained spatial reasoning. Leveraging the CARLA simulator, we generate large-scale, well-annotated radar datasets across diverse driving scenarios. We also propose localization-aware metrics that assess spatial accuracy beyond traditional detection measures.