AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large vision-language models (LVLMs) perform well on city-level coarse-grained geolocation but struggle with fine-grained street-level address localization from street-view imagery. To address this, we propose a cross-view alignment fine-tuning framework that leverages perspective-invariant satellite imagery as macroscopic contextual cues. Our method integrates street-view and satellite image mosaicking, self-supervised label generation, and a two-stage training pipeline—first optimizing cross-view feature alignment, then refining address prediction—to enhance the model’s global spatial understanding of street layouts. Evaluated on the Pittsburgh and San Francisco street-view benchmarks, our approach achieves average address localization accuracy improvements of 9% and 12%, respectively, over state-of-the-art LVLMs. This work advances fine-grained, queryable visual geolocation research by bridging semantic and geometric disparities across heterogeneous visual perspectives.

Technology Category

Application Category

📝 Abstract
Large visual language models (LVLMs) have demonstrated impressive performance in coarse-grained geo-localization at the country or city level, but they struggle with fine-grained street-level localization within urban areas. In this paper, we explore integrating city-wide address localization capabilities into LVLMs, facilitating flexible address-related question answering using street-view images. A key challenge is that the street-view visual question-and-answer (VQA) data provides only microscopic visual cues, leading to subpar performance in fine-tuned models. To tackle this issue, we incorporate perspective-invariant satellite images as macro cues and propose cross-view alignment tuning including a satellite-view and street-view image grafting mechanism, along with an automatic label generation mechanism. Then LVLM's global understanding of street distribution is enhanced through cross-view matching. Our proposed model, named AddressVLM, consists of two-stage training protocols: cross-view alignment tuning and address localization tuning. Furthermore, we have constructed two street-view VQA datasets based on image address localization datasets from Pittsburgh and San Francisco. Qualitative and quantitative evaluations demonstrate that AddressVLM outperforms counterpart LVLMs by over 9% and 12% in average address localization accuracy on these two datasets, respectively.
Problem

Research questions and friction points this paper is trying to address.

Enhancing fine-grained street-level localization in urban areas using LVLMs
Integrating satellite and street-view images for cross-view alignment tuning
Improving address localization accuracy with two-stage training protocols
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-view alignment tuning for LVLMs
Satellite and street-view image grafting
Automatic label generation mechanism
🔎 Similar Papers
No similar papers found.
S
Shixiong Xu
State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA, Zhongguancun East Road, Beijing, 100190, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100190, China
Chenghao Zhang
Chenghao Zhang
Renmin University of China
Natural Language ProcessingInformation RetrievalMultimodal
Lubin Fan
Lubin Fan
Alibaba Cloud
Computer GraphicsComputer VisionMLLM
Y
Yuan Zhou
State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA, Zhongguancun East Road, Beijing, 100190, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100190, China
B
Bin Fan
School of Intelligence Science and Technology, University of Science and Technology, Beijing, 100190, China
Shiming Xiang
Shiming Xiang
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
Distance Metric LearningSemi-supervised LearningManifold LearningRegressionFeature Selection
G
Gaofeng Meng
State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA, Zhongguancun East Road, Beijing, 100190, China; CAIR, HK Institute of Science & Innovation, Chinese Academy of Sciences, HongKong, Country
J
Jieping Ye
Alibaba Cloud, Beijing, 100020, China