When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning

πŸ“… 2025-03-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the trade-off between fine-grained detail preservation and prohibitive computational overhead when processing gigapixel-scale remote sensing images (RSIs) in large vision-language models (LVLMs), this paper proposes a text-guided coarse-to-fine tile selection and dynamic visual token pruning framework. Our method introduces: (1) a Dynamic Image Pyramid (DIP) for adaptive multi-scale representation; (2) a Region-Focusing Module (RFM) that localizes semantically critical image regions conditioned on textual queries; and (3) LRS-VQAβ€”the first large-scale RSI visual question answering benchmark, comprising 7,333 QA pairs across eight categories. Evaluated on four diverse RSI datasets, our approach significantly outperforms existing high-resolution LVLMs, maintaining superior fine-grained comprehension while achieving higher inference efficiency than state-of-the-art token compression methods. Both the source code and the LRS-VQA benchmark are publicly released.

Technology Category

Application Category

πŸ“ Abstract
Efficient vision-language understanding of large Remote Sensing Images (RSIs) is meaningful but challenging. Current Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images, leading to information loss when handling gigapixel RSIs. Conversely, using unlimited grids significantly increases computational costs. To preserve image details while reducing computational complexity, we propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration. Our method introduces: (i) a Region Focus Module (RFM) that leverages text-aware region localization capability to identify critical vision tokens, and (ii) a coarse-to-fine image tile selection and vision token pruning strategy based on DIP, which is guided by RFM outputs and avoids directly processing the entire large imagery. Additionally, existing benchmarks for evaluating LVLMs' perception ability on large RSI suffer from limited question diversity and constrained image sizes. We construct a new benchmark named LRS-VQA, which contains 7,333 QA pairs across 8 categories, with image length up to 27,328 pixels. Our method outperforms existing high-resolution strategies on four datasets using the same data. Moreover, compared to existing token reduction methods, our approach demonstrates higher efficiency under high-resolution settings. Dataset and code are in https://github.com/VisionXLab/LRS-VQA.
Problem

Research questions and friction points this paper is trying to address.

Efficient vision-language understanding of gigapixel Remote Sensing Images (RSIs)
Reducing computational costs while preserving image details
Addressing limited question diversity in LVLM benchmarks for RSIs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-guided token pruning with DIP integration
Region Focus Module for critical token identification
Coarse-to-fine image tile selection strategy
πŸ”Ž Similar Papers
No similar papers found.
Junwei Luo
Junwei Luo
Wuhan University
Vision-Language ModelOriented Object DetectionRemote Sensing
Y
Yingying Zhang
Ant Group
X
Xue Yang
Shanghai Jiao Tong University
K
Kang Wu
Wuhan University
Q
Qi Zhu
University of Science and Technology of China
J
Jingdong Chen
Ant Group
Yansheng Li
Yansheng Li
Professor, Wuhan University
Deep LearningKnowledge GraphRemote Sensing Big Data Mining