Skill-Conditioned Visual Geolocation for Vision-Language

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing vision-language models for image geolocation lack structured geographic reasoning capabilities and mechanisms for continuous evolution, relying instead on static implicit memory with non-iterative inference. This work proposes GeoSkill, a novel framework that introduces, for the first time, a trainable-free, evolvable skill graph represented in natural language to encode geographic reasoning skills. By integrating multi-turn reasoning with backtracking and a skill synthesis-pruning algorithm, GeoSkill enables the system to autonomously distill and refine skills from both successful and failed reasoning trajectories. Evaluated on the GeoRC benchmark and multiple external datasets, GeoSkill significantly improves geolocation accuracy and reasoning faithfulness, while demonstrating verifiable emergent geographic reasoning abilities and strong generalization across diverse settings.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) have shown a promising ability in image geolocation, but they still lack structured geographic reasoning and the capacity for autonomous self-evolution. Existing methods predominantly rely on implicit parametric memory, which often exploits outdated knowledge and generates hallucinated reasoning. Furthermore, current inference is a"one-off"process, lacking the feedback loops necessary for self-evolution based on reasoning outcomes. To address these issues, we propose GeoSkill, a training-free framework based on an evolving Skill-Graph. We first initialize the graph by refining human expert trajectories into atomic, natural-language skills. For execution, GeoSkill employs an inference model to perform direct reasoning guided by the current Skill-Graph. For continuous growth, an Autonomous Evolution mechanism leverages a larger model to conduct multiple reasoning rollouts on image-coordinate pairs sourced from web-scale data and verified real-world reasoning. By analyzing both successful and failed trajectories from these rollouts, the mechanism iteratively synthesizes and prunes skills, effectively expanding the Skill-Graph and correcting geographic biases without any parameter updates. Experiments demonstrate that GeoSkill achieves promising performance in both geolocation accuracy and reasoning faithfulness on GeoRC, while maintaining superior generalization across diverse external datasets. Furthermore, our autonomous evolution fosters the emergence of novel, verifiable skills, significantly enhancing the system's cognition of real-world geographic knowledge beyond isolated case studies.

Problem

Research questions and friction points this paper is trying to address.

visual geolocation

vision-language models

geographic reasoning

self-evolution

hallucinated reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Skill-Graph

Autonomous Evolution

Visual Geolocation