DescribeEarth: Describe Anything for Remote Sensing Images

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing remote sensing image captioning methods predominantly operate at the coarse-grained image level, failing to capture object-level semantics and structural details. To address this, we propose Geo-DLC—the first object-level fine-grained captioning task for remote sensing—and introduce DE-Dataset, a large-scale benchmark with precise object-level attribute and contextual annotations, alongside the domain-specific evaluation framework DE-Benchmark. We further design DescribeEarth, a multimodal large language model tailored for remote sensing, incorporating a scale-adaptive focal mechanism and a domain-guided fusion module to jointly model high-resolution visual details and geospatial semantic priors. Experiments demonstrate that DescribeEarth consistently outperforms general-purpose multimodal LLMs on DE-Benchmark, achieving significant gains in factual accuracy, descriptive richness, and grammatical correctness. Notably, it exhibits robust performance across simple scenes, complex scenes, and out-of-distribution remote sensing imagery.

Technology Category

Application Category

📝 Abstract
Automated textual description of remote sensing images is crucial for unlocking their full potential in diverse applications, from environmental monitoring to urban planning and disaster management. However, existing studies in remote sensing image captioning primarily focus on the image level, lacking object-level fine-grained interpretation, which prevents the full utilization and transformation of the rich semantic and structural information contained in remote sensing images. To address this limitation, we propose Geo-DLC, a novel task of object-level fine-grained image captioning for remote sensing. To support this task, we construct DE-Dataset, a large-scale dataset contains 25 categories and 261,806 annotated instances with detailed descriptions of object attributes, relationships, and contexts. Furthermore, we introduce DE-Benchmark, a LLM-assisted question-answering based evaluation suite designed to systematically measure model capabilities on the Geo-DLC task. We also present DescribeEarth, a Multi-modal Large Language Model (MLLM) architecture explicitly designed for Geo-DLC, which integrates a scale-adaptive focal strategy and a domain-guided fusion module leveraging remote sensing vision-language model features to encode high-resolution details and remote sensing category priors while maintaining global context. Our DescribeEarth model consistently outperforms state-of-the-art general MLLMs on DE-Benchmark, demonstrating superior factual accuracy, descriptive richness, and grammatical soundness, particularly in capturing intrinsic object features and surrounding environmental attributes across simple, complex, and even out-of-distribution remote sensing scenarios. All data, code and weights are released at https://github.com/earth-insights/DescribeEarth.
Problem

Research questions and friction points this paper is trying to address.

Addresses lack of object-level fine-grained interpretation in remote sensing images
Proposes Geo-DLC task for detailed object attribute and relationship description
Develops MLLM architecture to capture high-resolution details and environmental context
Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-level fine-grained captioning for remote sensing
Scale-adaptive focal strategy with domain-guided fusion
LLM-assisted question-answering evaluation benchmark
🔎 Similar Papers
No similar papers found.
Kaiyu Li
Kaiyu Li
Wilfrid Laurier University, Canada
Data governance and Data preparationData market and Data economy
Z
Zixuan Jiang
College of Artificial Intelligence, Xi’an Jiaotong University, Xi’an 710049, China
X
Xiangyong Cao
School of Computer Science and Technology and Ministry of Education Key Lab For Intelligent Networks and Network Security, Xi’an Jiaotong University, Xi’an 710049, China
Jiayu Wang
Jiayu Wang
Beihang University & Jiangnan University & The University of Auckland
Soft sensordata drivenfault detectionprocess monitoring
Yuchen Xiao
Yuchen Xiao
Lead of Embodied AI R&D, Unitree | Research Scientist, J.P. Morgan | Ph.D. Northeastern University
Generative ModelsRobot LearningReinforcement LearningMulti-Agent Systems
Deyu Meng
Deyu Meng
Professor, Xi'an Jiaotong University
Machine LearningApplied MathematicsComputer VisionArtificial Intelligence
Z
Zhi Wang
School of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, China