🤖 AI Summary
This work addresses the challenge that existing large vision-language models struggle to accurately perceive and describe low-level physical degradations in remote sensing images due to domain shift. To bridge this gap, the authors introduce SenseBench, the first benchmark for low-level visual diagnosis in remote sensing, grounded in a physics-driven hierarchical taxonomy encompassing six major categories and 22 fine-grained degradation types, with over 10,000 meticulously annotated samples. The benchmark features a dual-task evaluation protocol assessing both perception and description capabilities. Systematic evaluation of 29 state-of-the-art vision-language models reveals critical limitations, including domain shift, confusion among multiple concurrent degradations, fluent but hallucinated descriptions, and inversion between perception and description. SenseBench thus provides a high-quality dataset and a reliable evaluation platform to advance research on remote sensing image quality understanding.
📝 Abstract
Low-level visual perception underpins reliable remote sensing (RS) image analysis, yet current image quality assessment (IQA) methods output uninterpretable scalar scores rather than characterizing physics-driven RS degradations, deviating markedly from the diagnostic needs of RS experts. While Vision-Language Models (VLMs) present a compelling alternative by delivering language-grounded IQA, their visual priors are heavily biased toward ground-level natural images. Consequently, whether VLMs can overcome this domain gap to perceive and articulate RS artifacts remains insufficiently studied. To bridge this gap, we propose \textbf{SenseBench}, the first dedicated diagnostic benchmark for RS low-level visual perception and description. Driven by a physics-based hierarchical taxonomy that unifies both non-reference and reference-based paradigms, SenseBench features over 10K meticulously curated instances across 6 major and 22 fine-grained RS degradation categories. Specifically, two complementary protocols are designed for evaluation: objective low-level visual \textit{perception} and subjective diagnostic \textit{description}. Comprehensive evaluation of 29 state-of-the-art VLMs reveals not only skewed domain priors and multi-distortion collapse, but also \textit{fluency illusion} and a \textit{perception-description inversion} effect. We hope SenseBench provides a robust evaluation testbed and high-quality diagnostic data to advance the development of VLMs in RS low-level perception. Code and datasets are available \href{https://github.com/Zhong-Chenchen/SenseBench}{\textcolor{blue}{here}}.