AutoGEEval: A Multimodal and Automated Framework for Geospatial Code Generation on GEE with Large Language Models

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

The lack of standardized, automated evaluation tools for geospatial code generation hinders rigorous assessment of large language models (LLMs) in this domain. Method: We propose AutoGEEval—the first multimodal, unit-level automated evaluation framework tailored for Google Earth Engine (GEE). It introduces AutoGEEval-Bench, a dedicated benchmark comprising 1,325 test cases spanning 26 geoscience data categories. The framework integrates GEE’s Python API, LLMs, multimodal prompt engineering, dynamic execution sandboxing, and fine-grained error classification to enable end-to-end evaluation of natural-language-to-geospatial-code translation. Contribution/Results: AutoGEEval establishes the first unified evaluation protocol for GEE-based code generation. We systematically assess 18 state-of-the-art LLMs, quantifying disparities across accuracy, computational resource consumption, execution efficiency, and error patterns. Both the benchmark and framework are open-sourced, providing a reproducible, extensible evaluation infrastructure for geospatial AI code generation research.

Technology Category

Application Category

📝 Abstract

Geospatial code generation is emerging as a key direction in the integration of artificial intelligence and geoscientific analysis. However, there remains a lack of standardized tools for automatic evaluation in this domain. To address this gap, we propose AutoGEEval, the first multimodal, unit-level automated evaluation framework for geospatial code generation tasks on the Google Earth Engine (GEE) platform powered by large language models (LLMs). Built upon the GEE Python API, AutoGEEval establishes a benchmark suite (AutoGEEval-Bench) comprising 1325 test cases that span 26 GEE data types. The framework integrates both question generation and answer verification components to enable an end-to-end automated evaluation pipeline-from function invocation to execution validation. AutoGEEval supports multidimensional quantitative analysis of model outputs in terms of accuracy, resource consumption, execution efficiency, and error types. We evaluate 18 state-of-the-art LLMs-including general-purpose, reasoning-augmented, code-centric, and geoscience-specialized models-revealing their performance characteristics and potential optimization pathways in GEE code generation. This work provides a unified protocol and foundational resource for the development and assessment of geospatial code generation models, advancing the frontier of automated natural language to domain-specific code translation.

Problem

Research questions and friction points this paper is trying to address.

Lack of standardized tools for geospatial code evaluation

Need for automated multimodal assessment of GEE code generation

Performance benchmarking of LLMs in geospatial coding tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal automated framework for geospatial code

Benchmark suite with 1325 GEE test cases

End-to-end evaluation pipeline with LLMs

🔎 Similar Papers

No similar papers found.