🤖 AI Summary
The lack of standardized, automated evaluation tools for geospatial code generation hinders rigorous assessment of large language models (LLMs) in this domain.
Method: We propose AutoGEEval—the first multimodal, unit-level automated evaluation framework tailored for Google Earth Engine (GEE). It introduces AutoGEEval-Bench, a dedicated benchmark comprising 1,325 test cases spanning 26 geoscience data categories. The framework integrates GEE’s Python API, LLMs, multimodal prompt engineering, dynamic execution sandboxing, and fine-grained error classification to enable end-to-end evaluation of natural-language-to-geospatial-code translation.
Contribution/Results: AutoGEEval establishes the first unified evaluation protocol for GEE-based code generation. We systematically assess 18 state-of-the-art LLMs, quantifying disparities across accuracy, computational resource consumption, execution efficiency, and error patterns. Both the benchmark and framework are open-sourced, providing a reproducible, extensible evaluation infrastructure for geospatial AI code generation research.
📝 Abstract
Geospatial code generation is emerging as a key direction in the integration of artificial intelligence and geoscientific analysis. However, there remains a lack of standardized tools for automatic evaluation in this domain. To address this gap, we propose AutoGEEval, the first multimodal, unit-level automated evaluation framework for geospatial code generation tasks on the Google Earth Engine (GEE) platform powered by large language models (LLMs). Built upon the GEE Python API, AutoGEEval establishes a benchmark suite (AutoGEEval-Bench) comprising 1325 test cases that span 26 GEE data types. The framework integrates both question generation and answer verification components to enable an end-to-end automated evaluation pipeline-from function invocation to execution validation. AutoGEEval supports multidimensional quantitative analysis of model outputs in terms of accuracy, resource consumption, execution efficiency, and error types. We evaluate 18 state-of-the-art LLMs-including general-purpose, reasoning-augmented, code-centric, and geoscience-specialized models-revealing their performance characteristics and potential optimization pathways in GEE code generation. This work provides a unified protocol and foundational resource for the development and assessment of geospatial code generation models, advancing the frontier of automated natural language to domain-specific code translation.