LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer

πŸ“… 2025-11-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address label scarcity, severe class imbalance, and geometric distortion interference in high-resolution remote sensing image land-cover classification, this paper proposes a text-guided diffusion-based generative data augmentation and deformation-aware classification framework. Methodologically, we leverage GPT-4o to generate scene descriptions that drive a diffusion model to synthesize high-fidelity, semantically consistent remote sensing imagery; we further design the Deformation-aware Vision Transformer (DViT), integrating DCNv4’s geometric deformation modeling with ViT’s global contextual representation for joint geometric-semantic feature learning. On the AID dataset, our method achieves 0.9572 overall accuracy and 0.9576 macro-F1, significantly outperforming ViT, ResNet50, and other baselines; it also demonstrates strong cross-dataset generalization on SIRI-WHU. Key contributions include: (i) the first description-driven diffusion generation paradigm for remote sensing imagery; and (ii) a novel deformation-aware ViT architecture that substantially enhances robustness and generalizability for fine-grained land-cover recognition under few-shot settings.

Technology Category

Application Category

πŸ“ Abstract
Land-cover underpins ecosystem services, hydrologic regulation, disaster-risk reduction, and evidence-based land planning; timely, accurate land-cover maps are therefore critical for environmental stewardship. Remote sensing-based land-cover classification offers a scalable route to such maps but is hindered by scarce and imbalanced annotations and by geometric distortions in high-resolution scenes. We propose LC4-DViT (Land-cover Creation for Land-cover Classification with Deformable Vision Transformer), a framework that combines generative data creation with a deformation-aware Vision Transformer. A text-guided diffusion pipeline uses GPT-4o-generated scene descriptions and super-resolved exemplars to synthesize class-balanced, high-fidelity training images, while DViT couples a DCNv4 deformable convolutional backbone with a Vision Transformer encoder to jointly capture fine-scale geometry and global context. On eight classes from the Aerial Image Dataset (AID)-Beach, Bridge, Desert, Forest, Mountain, Pond, Port, and River-DViT achieves 0.9572 overall accuracy, 0.9576 macro F1-score, and 0.9510 Cohen' s Kappa, improving over a vanilla ViT baseline (0.9274 OA, 0.9300 macro F1, 0.9169 Kappa) and outperforming ResNet50, MobileNetV2, and FlashInternImage. Cross-dataset experiments on a three-class SIRI-WHU subset (Harbor, Pond, River) yield 0.9333 overall accuracy, 0.9316 macro F1, and 0.8989 Kappa, indicating good transferability. An LLM-based judge using GPT-4o to score Grad-CAM heatmaps further shows that DViT' s attention aligns best with hydrologically meaningful structures. These results suggest that description-driven generative augmentation combined with deformation-aware transformers is a promising approach for high-resolution land-cover mapping.
Problem

Research questions and friction points this paper is trying to address.

Addresses scarce and imbalanced annotations in land-cover classification
Mitigates geometric distortions in high-resolution remote sensing scenes
Improves classification accuracy and transferability for environmental mapping
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-guided diffusion pipeline generates balanced training images
Deformable Vision Transformer captures fine geometry and global context
GPT-4o judges attention alignment with meaningful structures
πŸ”Ž Similar Papers
No similar papers found.
K
Kai Wang
The Hong Kong University of Science and Technology; The Chinese University of Hong Kong, Shenzhen
S
Siyi Chen
The Hong Kong University of Science and Technology; The Johns Hopkins University
W
Weicong Pang
The Hong Kong University of Science and Technology; National University of Singapore
C
Chenchen Zhang
The Hong Kong University of Science and Technology
R
Renjun Gao
Macau University of Science and Technology
Ziru Chen
Ziru Chen
The Ohio State University
Conversational AINatural Language ProcessingMachine Learning
C
Cheng Li
The Hong Kong University of Science and Technology
Dasa Gu
Dasa Gu
Hong Kong University of Science and Technology
Atmospheric ChemistryVolatile Organic CompoundsNumerical ModelingSatellite Remote SensingEmission
R
Rui Huang
The Hong Kong University of Science and Technology; The Chinese University of Hong Kong, Shenzhen
A
Alexis Kai Hon Lau
The Hong Kong University of Science and Technology