EarthVL: A Progressive Earth Vision-Language Understanding and Generation Framework

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited relational reasoning capability of existing Earth vision methods, which hinders comprehensive scene understanding despite progress in geospatial object recognition. To advance holistic interpretation for urban planning applications, the study introduces a multitask framework featuring EarthVLSet—the first multimodal dataset integrating images, masks, and textual annotations—and proposes EarthVLNet, a semantic-guided network that progressively fuses remote sensing imagery and language through semantic segmentation, object-aware large language modeling, and visual-language question answering. A novel numerical discrepancy loss enables dynamic cross-task optimization, yielding state-of-the-art performance across three benchmarks: semantic segmentation, multiple-choice, and open-ended visual question answering. The findings further demonstrate that segmentation features enhance cross-dataset VQA performance, offering a promising direction toward integrative Earth vision understanding.

Technology Category

Application Category

📝 Abstract
Earth vision has achieved milestones in geospatial object recognition but lacks exploration in object-relational reasoning, limiting comprehensive scene understanding. To address this, a progressive Earth vision-language understanding and generation framework is proposed, including a multi-task dataset (EarthVLSet) and a semantic-guided network (EarthVLNet). Focusing on city planning applications, EarthVLSet includes 10.9k sub-meter resolution remote sensing images, land-cover masks, and 761.5k textual pairs involving both multiple-choice and open-ended visual question answering (VQA) tasks. In an object-centric way, EarthVLNet is proposed to progressively achieve semantic segmentation, relational reasoning, and comprehensive understanding. The first stage involves land-cover segmentation to generate object semantics for VQA guidance. Guided by pixel-wise semantics, the object awareness based large language model (LLM) performs relational reasoning and knowledge summarization to generate the required answers. As for optimization, the numerical difference loss is proposed to dynamically add difference penalties, addressing the various objects'statistics. Three benchmarks, including semantic segmentation, multiple-choice, and open-ended VQA demonstrated the superiorities of EarthVLNet, yielding three future directions: 1) segmentation features consistently enhance VQA performance even in cross-dataset scenarios; 2) multiple-choice tasks show greater sensitivity to the vision encoder than to the language decoder; and 3) open-ended tasks necessitate advanced vision encoders and language decoders for an optimal performance. We believe this dataset and method will provide a beneficial benchmark that connects''image-mask-text'', advancing geographical applications for Earth vision.
Problem

Research questions and friction points this paper is trying to address.

Earth vision
relational reasoning
scene understanding
remote sensing
visual question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Earth Vision-Language
Semantic-Guided Reasoning
Relational Reasoning
Multi-task Remote Sensing Dataset
Object-Centric VQA