A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing remote sensing multimodal large language model (MLLM) benchmarks predominantly rely on low-resolution imagery or suffer from flawed reasoning task designs, enabling purely textual LLMs to achieve high scores without visual input—seriously undermining the objective of evaluating genuine visual understanding. To address this, we propose RSHR-Bench, the first multimodal evaluation benchmark specifically designed for ultra-high-resolution (UHR) remote sensing imagery. It comprises 5,329 full-scene images with long-side dimensions ≥4,000 pixels and supports four task types: multiple-choice and open-ended visual question answering (VQA), image captioning, and single-image assessment. Innovatively, it employs strong LLM-based adversarial filtering coupled with rigorous human verification to suppress linguistic priors, while enabling multi-turn and multi-image interactive evaluation. Experiments reveal substantial performance degradation of mainstream vision-language models (VLMs) on UHR inputs. The benchmark is publicly released, including 3,864 VQA items, 3,913 captions, and 500 human-verified samples.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) demonstrate strong perception and reasoning performance on existing remote sensing (RS) benchmarks. However, most prior benchmarks rely on low-resolution imagery, and some high-resolution benchmarks suffer from flawed reasoning-task designs. We show that text-only LLMs can perform competitively with multimodal vision-language models on RS reasoning tasks without access to images, revealing a critical mismatch between current benchmarks and the intended evaluation of visual understanding. To enable faithful assessment, we introduce RSHR-Bench, a super-high-resolution benchmark for RS visual understanding and reasoning. RSHR-Bench contains 5,329 full-scene images with a long side of at least 4,000 pixels, with up to about 3 x 10^8 pixels per image, sourced from widely used RS corpora and UAV collections. We design four task families: multiple-choice VQA, open-ended VQA, image captioning, and single-image evaluation. These tasks cover nine perception categories and four reasoning types, supporting multi-turn and multi-image dialog. To reduce reliance on language priors, we apply adversarial filtering with strong LLMs followed by rigorous human verification. Overall, we construct 3,864 VQA tasks, 3,913 image captioning tasks, and 500 fully human-written or verified single-image evaluation VQA pairs. Evaluations across open-source, closed-source, and RS-specific VLMs reveal persistent performance gaps in super-high-resolution scenarios. Code: https://github.com/Yunkaidang/RSHR
Problem

Research questions and friction points this paper is trying to address.

Develops benchmark for ultra-high-resolution remote sensing visual reasoning
Addresses flawed reasoning-task designs in existing remote sensing benchmarks
Reduces language priors reliance via adversarial filtering and human verification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces super-high-resolution benchmark RSHR-Bench
Applies adversarial filtering with LLMs for language bias reduction
Designs four task families covering perception and reasoning
🔎 Similar Papers
No similar papers found.
Y
Yunkai Dang
School of Artificial Intelligence Science and Technology, Nanjing University
M
Meiyi Zhu
School of Artificial Intelligence Science and Technology, Nanjing University
D
Donghao Wang
School of Artificial Intelligence Science and Technology, Nanjing University
Y
Yizhuo Zhang
School of Artificial Intelligence Science and Technology, Nanjing University
Jiacheng Yang
Jiacheng Yang
Nanjing University
🧠 Large Multimodal Models💪 Reinforcement Learning🥽 Visual Reasoning
Q
Qi Fan
School of Artificial Intelligence Science and Technology, Nanjing University
Y
Yuekun Yang
School of Artificial Intelligence Science and Technology, Nanjing University
W
Wenbin Li
School of Artificial Intelligence Science and Technology, Nanjing University
Feng Miao
Feng Miao
Professor of Physics, Nanjing University
Mesoscopic physicsNanoelectronics2D materials
Y
Yang Gao
School of Artificial Intelligence Science and Technology, Nanjing University