🤖 AI Summary
This work addresses temporal inconsistency and reconstruction distortion arising from cross-modal fusion (e.g., radar/sonar → LiDAR) in LiDAR-based SLAM. To this end, we propose a temporally aware multimodal fusion framework. Methodologically, it incorporates a temporal embedding alignment module, a motion-aligned loss function, and a windowed temporal fusion mechanism; integrated with Cartographer atop LiDAR-BIND to enable end-to-end SLAM optimization. A key contribution is the introduction of novel temporal quality metrics—including FVMD—to ensure plug-and-play compatibility and robust temporal calibration. Experiments demonstrate significant improvements in pose estimation stability and occupancy map accuracy: the absolute trajectory error (ATE) is reduced by 21.3% on average. Moreover, cross-modal reconstruction achieves superior spatiotemporal consistency and geometric fidelity compared to state-of-the-art baselines.
📝 Abstract
This paper extends LiDAR-BIND, a modular multi-modal fusion framework that binds heterogeneous sensors (radar, sonar) to a LiDAR-defined latent space, with mechanisms that explicitly enforce temporal consistency. We introduce three contributions: (i) temporal embedding similarity that aligns consecutive latents, (ii) a motion-aligned transformation loss that matches displacement between predictions and ground truth LiDAR, and (iii) windows temporal fusion using a specialised temporal module. We further update the model architecture to better preserve spatial structure. Evaluations on radar/sonar-to-LiDAR translation demonstrate improved temporal and spatial coherence, yielding lower absolute trajectory error and better occupancy map accuracy in Cartographer-based SLAM (Simultaneous Localisation and Mapping). We propose different metrics based on the Fréchet Video Motion Distance (FVMD) and a correlation-peak distance metric providing practical temporal quality indicators to evaluate SLAM performance. The proposed temporal LiDAR-BIND, or LiDAR-BIND-T, maintains plug-and-play modality fusion while substantially enhancing temporal stability, resulting in improved robustness and performance for downstream SLAM.