🤖 AI Summary
Existing geospatial question-answering benchmarks are largely confined to static retrieval tasks and thus insufficient for evaluating large language models’ capabilities in dynamic, multi-objective geographic exploration. This work proposes EVGeoQA—the first benchmark that integrates real-time user location anchoring with dual constraints of charging needs and activity preferences—to assess dynamic spatial reasoning. We further introduce GeoRover, a tool-augmented agent framework leveraging real-world geographic data to systematically evaluate models’ spatial planning and reasoning abilities. Experimental results demonstrate that while large language models can successfully complete subtasks with tool assistance, their performance degrades over long-distance explorations. Notably, these models exhibit emergent capabilities in refining exploration strategies by leveraging memory of historical trajectories.
📝 Abstract
While Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, their potential for purpose-driven exploration in dynamic geo-spatial environments remains under-investigated. Existing Geo-Spatial Question Answering (GSQA) benchmarks predominantly focus on static retrieval, failing to capture the complexity of real-world planning that involves dynamic user locations and compound constraints. To bridge this gap, we introduce EVGeoQA, a novel benchmark built upon Electric Vehicle (EV) charging scenarios that features a distinct location-anchored and dual-objective design. Specifically, each query in EVGeoQA is explicitly bound to a user's real-time coordinate and integrates the dual objectives of a charging necessity and a co-located activity preference. To systematically assess models in such complex settings, we further propose GeoRover, a general evaluation framework based on a tool-augmented agent architecture to evaluate the LLMs' capacity for dynamic, multi-objective exploration. Our experiments reveal that while LLMs successfully utilize tools to address sub-tasks, they struggle with long-range spatial exploration. Notably, we observe an emergent capability: LLMs can summarize historical exploration trajectories to enhance exploration efficiency. These findings establish EVGeoQA as a challenging testbed for future geo-spatial intelligence. The dataset and prompts are available at https://github.com/Hapluckyy/EVGeoQA/.