VXP: Voxel-Cross-Pixel Large-scale Image-LiDAR Place Recognition

📅 2024-03-21

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 1

career value

213K/year

🤖 AI Summary

To address cross-modal place recognition under GPS-denied conditions, this paper proposes a robust global descriptor learning framework for images and LiDAR point clouds. The method introduces a novel voxel-pixel-level self-supervised local alignment mechanism: semantic representations are extracted via vision Transformers, while a geometric alignment module enforces structural consistency across modalities; all features are embedded into a lightweight shared latent space, where three-stage self-supervised training and cross-modal feature aggregation are jointly optimized. Evaluated on Oxford RobotCar, ViViD++, and KITTI benchmarks, the approach achieves significant improvements over state-of-the-art methods—delivering high accuracy while maintaining efficient inference and low computational overhead. This work establishes a scalable, generalizable paradigm for heterogeneous sensor localization in GPS-absent environments.

Technology Category

Application Category

📝 Abstract

Cross-modal place recognition methods are flexible GPS-alternatives under varying environment conditions and sensor setups. However, this task is non-trivial since extracting consistent and robust global descriptors from different modalities is challenging. To tackle this issue, we propose Voxel-Cross-Pixel (VXP), a novel camera-to-LiDAR place recognition framework that enforces local similarities in a self-supervised manner and effectively brings global context from images and LiDAR scans into a shared feature space. Specifically, VXP is trained in three stages: first, we deploy a visual transformer to compactly represent input images. Secondly, we establish local correspondences between image-based and point cloud-based feature spaces using our novel geometric alignment module. We then aggregate local similarities into an expressive shared latent space. Extensive experiments on the three benchmarks (Oxford RobotCar, ViViD++ and KITTI) demonstrate that our method surpasses the state-of-the-art cross-modal retrieval by a large margin. Our evaluations show that the proposed method is accurate, efficient and light-weight. Our project page is available at: https://yunjinli.github.io/projects-vxp/

Problem

Research questions and friction points this paper is trying to address.

Cross-modal place recognition under varying conditions

Extracting consistent global descriptors from different modalities

Enhancing camera-to-LiDAR place recognition accuracy and efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised local similarity enforcement

Geometric alignment module for cross-modal features

Shared latent space aggregation for global context

🔎 Similar Papers

No similar papers found.