🤖 AI Summary
Current AI systems exhibit insufficient reliability in Earth observation tasks, hindering critical applications such as environmental monitoring and disaster management. To diagnose this limitation, we introduce UnivEarth—the first LLM-agent benchmark tailored for Earth observation—comprising 140 true/false questions. Evaluation reveals that state-of-the-art LLM agents fail to correctly invoke the Google Earth Engine (GEE) API in over 58% of cases, achieving only 33% accuracy. To address this, we propose a lightweight adaptation method based on synthetically generated data, employing supervised fine-tuning to enhance small-language-models’ (e.g., Llama-3.1-8B) GEE API comprehension and execution capabilities. Our approach achieves accuracy comparable to large models like DeepSeek-R1 on UnivEarth, while substantially reducing deployment cost and computational overhead. This work establishes a new paradigm for developing trustworthy, efficient, and cost-effective Earth observation AI agents and releases the benchmark and methods as open-source resources.
📝 Abstract
Earth Observation (EO) provides critical planetary data for environmental monitoring, disaster management, climate science, and other scientific domains. Here we ask: Are AI systems ready for reliable Earth Observation? We introduce datasetnamenospace, a benchmark of 140 yes/no questions from NASA Earth Observatory articles across 13 topics and 17 satellite sensors. Using Google Earth Engine API as a tool, LLM agents can only achieve an accuracy of 33% because the code fails to run over 58% of the time. We improve the failure rate for open models by fine-tuning synthetic data, allowing much smaller models (Llama-3.1-8B) to achieve comparable accuracy to much larger ones (e.g., DeepSeek-R1). Taken together, our findings identify significant challenges to be solved before AI agents can automate earth observation, and suggest paths forward. The project page is available at https://iandrover.github.io/UnivEarth.