GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address the dual challenges of scarce training data and token explosion in ultra-high-resolution (UHR) remote sensing image understanding, this work introduces the first remote sensing multimodal large language model capable of processing 8K×8K inputs. Methodologically, we propose a novel background token pruning mechanism and anchor token selection strategy, integrated with object-centric attention, high-resolution fine-tuning, and remote sensing domain-specific instruction alignment—implemented atop the LLaVA framework for efficient token sparsification. We release SuperRS-VQA (mean resolution 8376×8376), the highest-resolution remote sensing vision-language dataset to date, alongside HighRS-VQA. Experiments demonstrate state-of-the-art performance on XLRS-Bench, substantial reduction in GPU memory consumption, and improved capabilities in fine-grained land-cover classification and long-context visual question answering.

Technology Category

Application Category

📝 Abstract

Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg. 8,376$ imes$8,376) and HighRS-VQA (avg. 2,000$ imes$1,912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: Background Token Pruning and Anchored Token Selection, to reduce the memory footprint while preserving key semantics.Integrating these techniques, we introduce GeoLLaVA-8K, the first RS-focused multimodal large language model capable of handling inputs up to 8K$ imes$8K resolution, built on the LLaVA framework. Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art on the XLRS-Bench.

Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity in ultra-high-resolution remote sensing imagery

Mitigating token explosion from large remote sensing images

Enhancing multimodal models for 8K resolution Earth observation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SuperRS-VQA and HighRS-VQA datasets

Uses Background Token Pruning strategy

Implements Anchored Token Selection technique

🔎 Similar Papers

RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models