LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

To address label scarcity, severe class imbalance, and geometric distortion interference in high-resolution remote sensing image land-cover classification, this paper proposes a text-guided diffusion-based generative data augmentation and deformation-aware classification framework. Methodologically, we leverage GPT-4o to generate scene descriptions that drive a diffusion model to synthesize high-fidelity, semantically consistent remote sensing imagery; we further design the Deformation-aware Vision Transformer (DViT), integrating DCNv4’s geometric deformation modeling with ViT’s global contextual representation for joint geometric-semantic feature learning. On the AID dataset, our method achieves 0.9572 overall accuracy and 0.9576 macro-F1, significantly outperforming ViT, ResNet50, and other baselines; it also demonstrates strong cross-dataset generalization on SIRI-WHU. Key contributions include: (i) the first description-driven diffusion generation paradigm for remote sensing imagery; and (ii) a novel deformation-aware ViT architecture that substantially enhances robustness and generalizability for fine-grained land-cover recognition under few-shot settings.

Technology Category

Application Category

📝 Abstract

Land-cover underpins ecosystem services, hydrologic regulation, disaster-risk reduction, and evidence-based land planning; timely, accurate land-cover maps are therefore critical for environmental stewardship. Remote sensing-based land-cover classification offers a scalable route to such maps but is hindered by scarce and imbalanced annotations and by geometric distortions in high-resolution scenes. We propose LC4-DViT (Land-cover Creation for Land-cover Classification with Deformable Vision Transformer), a framework that combines generative data creation with a deformation-aware Vision Transformer. A text-guided diffusion pipeline uses GPT-4o-generated scene descriptions and super-resolved exemplars to synthesize class-balanced, high-fidelity training images, while DViT couples a DCNv4 deformable convolutional backbone with a Vision Transformer encoder to jointly capture fine-scale geometry and global context. On eight classes from the Aerial Image Dataset (AID)-Beach, Bridge, Desert, Forest, Mountain, Pond, Port, and River-DViT achieves 0.9572 overall accuracy, 0.9576 macro F1-score, and 0.9510 Cohen' s Kappa, improving over a vanilla ViT baseline (0.9274 OA, 0.9300 macro F1, 0.9169 Kappa) and outperforming ResNet50, MobileNetV2, and FlashInternImage. Cross-dataset experiments on a three-class SIRI-WHU subset (Harbor, Pond, River) yield 0.9333 overall accuracy, 0.9316 macro F1, and 0.8989 Kappa, indicating good transferability. An LLM-based judge using GPT-4o to score Grad-CAM heatmaps further shows that DViT' s attention aligns best with hydrologically meaningful structures. These results suggest that description-driven generative augmentation combined with deformation-aware transformers is a promising approach for high-resolution land-cover mapping.

Problem

Research questions and friction points this paper is trying to address.

Addresses scarce and imbalanced annotations in land-cover classification

Mitigates geometric distortions in high-resolution remote sensing scenes

Improves classification accuracy and transferability for environmental mapping

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-guided diffusion pipeline generates balanced training images

Deformable Vision Transformer captures fine geometry and global context

GPT-4o judges attention alignment with meaningful structures

🔎 Similar Papers

No similar papers found.