GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the dual challenges of scarce training data and token explosion in ultra-high-resolution (UHR) remote sensing image understanding, this work introduces the first remote sensing multimodal large language model capable of processing 8K×8K inputs. Methodologically, we propose a novel background token pruning mechanism and anchor token selection strategy, integrated with object-centric attention, high-resolution fine-tuning, and remote sensing domain-specific instruction alignment—implemented atop the LLaVA framework for efficient token sparsification. We release SuperRS-VQA (mean resolution 8376×8376), the highest-resolution remote sensing vision-language dataset to date, alongside HighRS-VQA. Experiments demonstrate state-of-the-art performance on XLRS-Bench, substantial reduction in GPU memory consumption, and improved capabilities in fine-grained land-cover classification and long-context visual question answering.

Technology Category

Application Category

📝 Abstract
Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg. 8,376$ imes$8,376) and HighRS-VQA (avg. 2,000$ imes$1,912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: Background Token Pruning and Anchored Token Selection, to reduce the memory footprint while preserving key semantics.Integrating these techniques, we introduce GeoLLaVA-8K, the first RS-focused multimodal large language model capable of handling inputs up to 8K$ imes$8K resolution, built on the LLaVA framework. Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art on the XLRS-Bench.
Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity in ultra-high-resolution remote sensing imagery
Mitigating token explosion from large remote sensing images
Enhancing multimodal models for 8K resolution Earth observation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SuperRS-VQA and HighRS-VQA datasets
Uses Background Token Pruning strategy
Implements Anchored Token Selection technique
🔎 Similar Papers
No similar papers found.
Fengxiang Wang
Fengxiang Wang
National University of Defense Technology
Computer VisionRemote Sensing
M
Mingshuo Chen
Beijing University of Posts and Telecommunications, China
Y
Yueying Li
College of Computer Science and Technology, National University of Defense Technology, China
D
Di Wang
School of Computer Science, Wuhan University, China
H
Haotian Wang
College of Computer Science and Technology, National University of Defense Technology, China
Zonghao Guo
Zonghao Guo
University of Chinese Academy of Sciences
Zefan Wang
Zefan Wang
Tsinghua University
machine learning
B
Boqi Shan
Beihang University, China
L
Long Lan
College of Computer Science and Technology, National University of Defense Technology, China
Yulin Wang
Yulin Wang
Shanghai Jiao Tong University
H
Hongzhen Wang
Tsinghua University, China
W
Wenjing Yang
College of Computer Science and Technology, National University of Defense Technology, China
Bo Du
Bo Du
Department of Management, Griffith Business School
Sustainable TransportTravel BehaviourUrban Data AnalyticsLogistics and Supply Chain
J
Jing Zhang
School of Computer Science, Wuhan University, China