DynamicVis: An Efficient and General Visual Foundation Model for Remote Sensing Image Understanding

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address key bottlenecks in high-resolution remote sensing imagery—namely sparse object distribution, difficulty in semantic modeling, and poor cross-task generalization—this paper proposes a dynamic region-aware backbone based on selective state space models (SSMs). It introduces, for the first time, region-level meta-embeddings and a multi-instance learning paradigm to enable fine-grained small-object modeling and efficient knowledge transfer. The method integrates dynamic region attention with large-scale remote sensing region-level annotations, substantially enhancing long-sequence 2D token modeling capability. Evaluated across nine downstream tasks, it achieves state-of-the-art performance. For inference on 2048×2048 images, it requires only 97 ms (6% of ViT’s latency) and 833 MB GPU memory (3% of ViT’s), enabling scalable, fine-grained semantic understanding of large-scale scenes.

Technology Category

Application Category

📝 Abstract

The advancement of remote sensing technology has improved the spatial resolution of satellite imagery, facilitating more detailed visual representations for diverse interpretations. However, existing methods exhibit limited generalization capabilities across varied applications. While some contemporary foundation models demonstrate potential, they are hindered by insufficient cross-task adaptability and primarily process low-resolution imagery of restricted sizes, thus failing to fully exploit high-resolution data or leverage comprehensive large-scene semantics. Crucially, remote sensing imagery differs fundamentally from natural images, as key foreground targets (eg., maritime objects, artificial structures) often occupy minimal spatial proportions (~1%) and exhibit sparse distributions. Efficiently modeling cross-task generalizable knowledge from lengthy 2D tokens (~100,000) poses a significant challenge yet remains critical for remote sensing image understanding. Motivated by the selective attention mechanisms inherent to the human visual system, we propose DynamicVis, a dynamic visual perception foundation model for remote sensing imagery. The framework integrates a novel dynamic region perception backbone based on the selective state space model, which strategically balances localized detail extraction with global contextual integration, enabling computationally efficient encoding of large-scale data while maintaining architectural scalability. To enhance cross-task knowledge transferring, we introduce a multi-instance learning paradigm utilizing meta-embedding representations, trained on million-scale region-level annotations. Evaluations across nine downstream tasks demonstrate the model's versatility. DynamicVis achieves multi-level feature modeling with exceptional efficiency, processing (2048x2048) pixels with 97 ms latency (6% of ViT's) and 833 MB GPU memory (3% of ViT's).

Problem

Research questions and friction points this paper is trying to address.

Limited generalization in remote sensing image analysis.

Challenges in processing high-resolution, large-scale imagery.

Sparse distribution of key targets in remote sensing data.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic region perception backbone for detail extraction

Multi-instance learning with meta-embedding representations

Efficient large-scale data encoding with low latency

🔎 Similar Papers

No similar papers found.

Authors to Follow