🤖 AI Summary
This work addresses the severe performance degradation of cameras and LiDAR under adverse weather conditions such as rain, fog, and snow by proposing a weather-robust semantic perception method based on 4D millimeter-wave radar. The approach aligns a lightweight radar encoder—containing only approximately 7 million trainable parameters—with frozen SigLIP visual embeddings and introduces a LayerNorm projection layer to mitigate token-norm mismatch between radar features and vision-language models (VLMs), thereby enabling the first effective fusion of radar signals with a frozen VLM. Combined with a tailored pooling strategy and structured captioning format, the method significantly outperforms camera-based baselines—which exhibit hallucination rates exceeding 90%—on fog, light snow, and heavy snow scenarios in the K-RADAR dataset, demonstrating its robustness and efficacy.
📝 Abstract
Cameras and LiDAR degrade in rain, fog, and snow, while millimeter-wave radar remains largely unaffected. We align a radar encoder to frozen SigLIP vision
embeddings and decode structured scene captions through a frozen vision-language model (VLM) with approximately 7M trainable parameters. On K-RADAR with
held-out fog, light snow, and heavy snow sequences, all radar configurations outperform a camera baseline that collapses to over 90% hallucination. We
identify a token-norm mismatch as the dominant failure mode when bridging radar to a frozen VLM and show that projector-output LayerNorm resolves it.
Analysis of encoder complexity, caption format, and pooling strategy reveals tradeoffs that inform future radar-VLM pipeline design.