MAG-VLAQ: Multi-modal Aerial-Ground Query Aggregation for Cross-View Place Recognition

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the challenge of cross-view place recognition between ground-level and aerial perspectives, which arises from significant differences in viewpoint, modality, and spatial structure. To tackle this, the authors propose a foundation-model-based multimodal query aggregation framework that jointly leverages ground-level RGB images, LiDAR point clouds, and aerial imagery. The method aligns these heterogeneous modalities within a shared embedding space and introduces two key innovations: an ordinary differential equation (ODE)-driven fusion mechanism for RGB and LiDAR data, and a Visual-Local Aggregation Query (VLAQ) module that enables global descriptors to dynamically adapt to scene-specific visual and geometric cues. Evaluated on the KITTI360-AG dataset, the approach achieves a Recall@1 of 61.1%, nearly doubling the performance of the current state-of-the-art and substantially advancing cross-view place recognition capabilities.

📝 Abstract

Multi-modal cross-view place recognition remains a fundamental challenge in computer vision and robotics due to the severe viewpoint, modality, and spatial-structure discrepancies between ground observations and aerial references. To address this challenge, we present MAG-VLAQ, a foundation-model-enhanced query aggregation framework for multi-modal aerial-ground cross-view place recognition. Specifically, our approach leverages pre-trained foundation models to extract dense visual tokens from both ground and aerial images, as well as expressive geometric tokens from ground LiDAR observations. These heterogeneous tokens are then projected into a shared embedding space for cross-modal alignment and fusion. As our main contribution, we propose ODE-conditioned VLAQ, which tightly couples neural ordinary differential equations (ODE)-based RGB-LiDAR fusion with vectors of locally aggregated queries (VLAQ). In this design, the VLAQ query centers are dynamically adapted according to the fused multi-modal state. This mechanism allows the final global descriptor to preserve globally learned retrieval prototypes while remaining responsive to scene-specific visual and geometric evidence, significantly improving aerial-ground matching. Extensive experiments on KITTI360-AG and nuScenes-AG validate the effectiveness of our proposed MAG-VLAQ. Notably, on KITTI360-AG, our MAG-VLAQ nearly doubles the state-of-the-art performance, achieving 61.1 Recall@1 in the satellite setting, compared with 34.5 from the closest competing approach.

Problem

Research questions and friction points this paper is trying to address.

cross-view place recognition

multi-modal

aerial-ground matching

viewpoint discrepancy

modality gap

Innovation

Methods, ideas, or system contributions that make the work stand out.

ODE-conditioned fusion

VLAQ

multi-modal place recognition