🤖 AI Summary
To address the dual challenges of scarce training data and token explosion in ultra-high-resolution (UHR) remote sensing image understanding, this work introduces the first remote sensing multimodal large language model capable of processing 8K×8K inputs. Methodologically, we propose a novel background token pruning mechanism and anchor token selection strategy, integrated with object-centric attention, high-resolution fine-tuning, and remote sensing domain-specific instruction alignment—implemented atop the LLaVA framework for efficient token sparsification. We release SuperRS-VQA (mean resolution 8376×8376), the highest-resolution remote sensing vision-language dataset to date, alongside HighRS-VQA. Experiments demonstrate state-of-the-art performance on XLRS-Bench, substantial reduction in GPU memory consumption, and improved capabilities in fine-grained land-cover classification and long-context visual question answering.
📝 Abstract
Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg. 8,376$ imes$8,376) and HighRS-VQA (avg. 2,000$ imes$1,912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: Background Token Pruning and Anchored Token Selection, to reduce the memory footprint while preserving key semantics.Integrating these techniques, we introduce GeoLLaVA-8K, the first RS-focused multimodal large language model capable of handling inputs up to 8K$ imes$8K resolution, built on the LLaVA framework. Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art on the XLRS-Bench.