🤖 AI Summary
This work addresses the weak cross-modal generalization and architectural complexity in natural language-guided cross-view geolocalization (NGCG) by proposing a novel paradigm that directly leverages multimodal large language models (MLLMs) as retrievers without architectural redesign. Through parameter-efficient fine-tuning, the approach preserves pretrained knowledge while optimizing latent representations to achieve strong image-text alignment. A systematic analysis of the model backbone and feature aggregation strategies effectively unlocks the inherent retrieval capabilities of MLLMs. Evaluated on GeoText-1652, the method improves Text-to-Image Recall@1 by 12.2% and achieves top performance on five out of twelve subtasks in CVG-Text, surpassing existing baselines with significantly fewer trainable parameters. This study thus introduces a concise, generalizable, and scalable framework for semantic cross-view retrieval.
📝 Abstract
Natural-language Guided Cross-view Geo-localization (NGCG) aims to retrieve geo-tagged satellite imagery using textual descriptions of ground scenes. While recent NGCG methods commonly rely on CLIP-style dual-encoder architectures, they often suffer from weak cross-modal generalization and require complex architectural designs. In contrast, Multimodal Large Language Models (MLLMs) offer powerful semantic reasoning capabilities but are not directly optimized for retrieval tasks. In this work, we present a simple yet effective framework to adapt MLLMs for NGCG via parameter-efficient finetuning. Our approach optimizes latent representations within the MLLM while preserving its pretrained multimodal knowledge, enabling strong cross-modal alignment without redesigning model architectures. Through systematic analysis of diverse variables, from model backbone to feature aggregation, we provide practical and generalizable insights for leveraging MLLMs in NGCG. Our method achieves SOTA on GeoText-1652 with a 12.2% improvement in Text-to-Image Recall@1 and secures top performance in 5 out of 12 subtasks on CVG-Text, all while surpassing baselines with far fewer trainable parameters. These results position MLLMs as a robust foundation for semantic cross-view retrieval and pave the way for MLLM-based NGCG to be adopted as a scalable, powerful alternative to traditional dual-encoder designs. Project page and code are available at https://yuqichen888.github.io/NGCG-MLLMs-web/.