OmniEarth: A Benchmark for Evaluating Vision-Language Models in Geospatial Tasks

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of a systematic evaluation benchmark for vision-language models (VLMs) in real-world Earth observation scenarios by introducing OmniEarth—the first comprehensive VLM benchmark tailored to remote sensing. OmniEarth encompasses 28 fine-grained tasks across three core dimensions: perception, reasoning, and robustness, supporting both multiple-choice and open-ended visual question answering with multimodal outputs including text, bounding boxes, and masks. To mitigate language bias, the benchmark employs a blind-testing protocol and enforces five semantic consistency constraints. It integrates multi-source remote sensing data—including Jilin-1 imagery—an expert-validated instruction set, and a rigorous evaluation pipeline, comprising 9,275 images and 44,210 instructions. Experimental results reveal significant limitations of current VLMs on complex geospatial tasks. The benchmark is publicly released to foster standardized progress in the field.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have demonstrated effective perception and reasoning capabilities on general-domain tasks, leading to growing interest in their application to Earth observation. However, a systematic benchmark for comprehensively evaluating remote sensing vision-language models (RSVLMs) remains lacking. To address this gap, we introduce OmniEarth, a benchmark for evaluating RSVLMs under realistic Earth observation scenarios. OmniEarth organizes tasks along three capability dimensions: perception, reasoning, and robustness. It defines 28 fine-grained tasks covering multi-source sensing data and diverse geospatial contexts. The benchmark supports two task formulations: multiple-choice VQA and open-ended VQA. The latter includes pure text outputs for captioning tasks, bounding box outputs for visual grounding tasks, and mask outputs for segmentation tasks. To reduce linguistic bias and examine whether model predictions rely on visual evidence, OmniEarth adopts a blind test protocol and a quintuple semantic consistency requirement. OmniEarth includes 9,275 carefully quality-controlled images, including proprietary satellite imagery from Jilin-1 (JL-1), along with 44,210 manually verified instructions. We conduct a systematic evaluation of contrastive learning-based models, general closed-source and open-source VLMs, as well as RSVLMs. Results show that existing VLMs still struggle with geospatially complex tasks, revealing clear gaps that need to be addressed for remote sensing applications. OmniEarth is publicly available at https://huggingface.co/datasets/sjeeudd/OmniEarth.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
Remote Sensing
Benchmark
Geospatial Tasks
Earth Observation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models
Remote Sensing
Geospatial Benchmark
Visual Question Answering
Robustness Evaluation
🔎 Similar Papers
No similar papers found.
R
Ronghao Fu
Jilin University
Haoran Liu
Haoran Liu
Ph.D. Student, Department of Computer Science & Engineering, Texas A&M University
LLMsGraph/Geometric LearningAI for ScienceGenerative Models
Weijie Zhang
Weijie Zhang
University of Kansas Medical Center
Inverse planningparticle therapy
Z
Zhiwen Lin
Jilin University
X
Xiao Yang
Jilin University
P
Peng Zhang
Chang Guang Satellite Technology
B
Bo Yang
Jilin University