Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing vision-language model (VLM) datasets predominantly focus on short-term, high-resolution, single-source satellite imagery, limiting their utility for robust, long-term global land surface monitoring. Method: We introduce Landsat30-AU—the first Australia-focused, multi-temporal vision-language dataset spanning 36 years, integrating observations from four Landsat satellites at 30 m spatial resolution. It comprises two task-specific subsets: image captioning (Landsat30-AU-Cap) and visual question answering (Landsat30-AU-VQA). We propose a novel bootstrapping pipeline that synergistically combines general-purpose VLMs, iterative refinement, and human verification to construct high-quality annotations for long-archive, low-resolution, multi-sensor remote sensing data. Contribution/Results: Benchmark experiments demonstrate that lightweight fine-tuning of Qwen2.5-VL-7B achieves substantial improvements: SPIDEr score increases from 0.11 to 0.31, and VQA accuracy rises from 0.74 to 0.87—validating the dataset’s effectiveness and learnability for long-term Earth observation tasks.

Technology Category

Application Category

📝 Abstract

Vision language models (VLMs) that enable natural language interaction with satellite imagery can democratize Earth observation by accelerating expert workflows, making data accessible to non-specialists, and enabling planet-scale automation. However, existing datasets focus mainly on short-term, high-resolution imagery from a limited number of satellites, overlooking low-resolution, multi-satellite, long-term archives, such as Landsat, that are essential for affordable and bias-robust global monitoring. We address this gap with Landsat30-AU, a large-scale vision-language dataset built from 30-meter resolution imagery collected by four Landsat satellites (5, 7, 8, and 9) over Australia, spanning more than 36 years. The dataset includes two components: Landsat30-AU-Cap, containing 196,262 image-caption pairs, and Landsat30-AU-VQA, comprising 17,725 human-verified visual question answering (VQA) samples across eight remote sensing domains. Both datasets are curated through a bootstrapped pipeline that leverages generic VLMs with iterative refinement and human verification to ensure quality. Our evaluation of eight VLMs on our benchmark reveals that off-the-shelf models struggle to understand satellite imagery. The open-source remote-sensing VLM EarthDial achieves only 0.07 SPIDEr in captioning and a VQA accuracy of 0.48, highlighting the limitations of current approaches. Encouragingly, lightweight fine-tuning of Qwen2.5-VL-7B on Landsat30-AU improves captioning performance from 0.11 to 0.31 SPIDEr and boosts VQA accuracy from extbf{0.74} to 0.87. Code and data are available at https://github.com/papersubmit1/landsat30-au.

Problem

Research questions and friction points this paper is trying to address.

Lack of vision-language datasets for long-term, low-resolution satellite imagery

Existing models struggle with satellite imagery understanding and accuracy

Need for affordable, bias-robust global monitoring solutions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Landsat30-AU dataset with 36-year multi-satellite imagery

Bootstrapped pipeline for quality image-caption and VQA data

Fine-tuning Qwen2.5-VL-7B improves captioning and VQA performance

🔎 Similar Papers

RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models