MEET: A Million-Scale Dataset for Fine-Grained Geospatial Scene Classification with Zoom-Free Remote Sensing Imagery

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

To address the reliance on manual scaling in fine-grained geographic scene classification of fixed-resolution remote sensing imagery, this paper proposes a scale-free paradigm. We introduce MEET, the first million-scale fine-grained dataset (1.03M samples, 80 classes), featuring a center-scene–context nested layout. We pioneer a within-scene classification paradigm and propose the Context-Aware Transformer (CAT), which models spatial context via adaptive center–context attention fusion. Built upon a fine-tuned Swin backbone, CAT incorporates a multi-scene joint attention mechanism. On the MEET benchmark, CAT significantly outperforms 11 baselines, achieving balanced accuracy gains of +1.88% over Swin-Large and +7.87% over Swin-Huge. Empirical results validate CAT’s effectiveness in real-world applications such as urban functional zone mapping.

Technology Category

Application Category

📝 Abstract

Accurate fine-grained geospatial scene classification using remote sensing imagery is essential for a wide range of applications. However, existing approaches often rely on manually zooming remote sensing images at different scales to create typical scene samples. This approach fails to adequately support the fixed-resolution image interpretation requirements in real-world scenarios. To address this limitation, we introduce the Million-scale finE-grained geospatial scEne classification dataseT (MEET), which contains over 1.03 million zoom-free remote sensing scene samples, manually annotated into 80 fine-grained categories. In MEET, each scene sample follows a scene-inscene layout, where the central scene serves as the reference, and auxiliary scenes provide crucial spatial context for finegrained classification. Moreover, to tackle the emerging challenge of scene-in-scene classification, we present the Context-Aware Transformer (CAT), a model specifically designed for this task, which adaptively fuses spatial context to accurately classify the scene samples. CAT adaptively fuses spatial context to accurately classify the scene samples by learning attentional features that capture the relationships between the center and auxiliary scenes. Based on MEET, we establish a comprehensive benchmark for fine-grained geospatial scene classification, evaluating CAT against 11 competitive baselines. The results demonstrate that CAT significantly outperforms these baselines, achieving a 1.88% higher balanced accuracy (BA) with the Swin-Large backbone, and a notable 7.87% improvement with the Swin-Huge backbone. Further experiments validate the effectiveness of each module in CAT and show the practical applicability of CAT in the urban functional zone mapping. The source code and dataset will be publicly available at https://jerrywyn.github.io/project/MEET.html.

Problem

Research questions and friction points this paper is trying to address.

Addresses fine-grained geospatial scene classification challenges.

Introduces MEET dataset with zoom-free remote sensing imagery.

Proposes Context-Aware Transformer for scene-in-scene classification.

Innovation

Methods, ideas, or system contributions that make the work stand out.

MEET dataset: million-scale, zoom-free remote sensing imagery

Context-Aware Transformer (CAT): adaptive spatial context fusion

Scene-in-scene layout: central and auxiliary scenes integration

🔎 Similar Papers

RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models