HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

201K/year
🤖 AI Summary
Existing multimodal large language models exhibit limited performance on hyperspectral remote sensing image understanding tasks and lack dedicated evaluation benchmarks. To address this gap, this work introduces HM-Bench, the first multimodal benchmark specifically designed for hyperspectral imagery, comprising 13 task categories and 19,337 question-answer pairs. The authors propose a dual-modality evaluation framework that transforms hyperspectral data into PCA-synthesized images and structured textual reports to enable systematic model assessment. Experiments across 18 state-of-the-art multimodal large models reveal their constrained capabilities in complex spectral-spatial reasoning tasks and demonstrate that visual inputs yield significantly better performance than text-only inputs, underscoring the critical role of visual representation in hyperspectral understanding.

Technology Category

Application Category

📝 Abstract
While multimodal large language models (MLLMs) have made significant strides in natural image understanding, their ability to perceive and reason over hyperspectral image (HSI) remains underexplored, which is a vital modality in remote sensing. The high dimensionality and intricate spectral-spatial properties of HSI pose unique challenges for models primarily trained on RGB data.To address this gap, we introduce Hyperspectral Multimodal Benchmark (HM-Bench), the first benchmark designed specifically to evaluate MLLMs in HSI understanding. We curate a large-scale dataset of 19,337 question-answer pairs across 13 task categories, ranging from basic perception to spectral reasoning. Given that existing MLLMs are not equipped to process raw hyperspectral cubes natively, we propose a dual-modality evaluation framework that transforms HSI data into two complementary representations: PCA-based composite images and structured textual reports. This approach facilitates a systematic comparison of different representation for model performance. Extensive evaluations on 18 representative MLLMs reveal significant difficulties in handling complex spatial-spectral reasoning tasks. Furthermore, our results demonstrate that visual inputs generally outperform textual inputs, highlighting the importance of grounding in spectral-spatial evidence for effective HSI understanding. Dataset and appendix can be accessed at https://github.com/HuoRiLi-Yu/HM-Bench.
Problem

Research questions and friction points this paper is trying to address.

multimodal large language models
hyperspectral image
remote sensing
spectral-spatial reasoning
high dimensionality
Innovation

Methods, ideas, or system contributions that make the work stand out.

hyperspectral remote sensing
multimodal large language models
benchmark
dual-modality evaluation
spectral-spatial reasoning
🔎 Similar Papers
No similar papers found.
X
Xinyu Zhang
Sun Yat-sen University
Z
Zurong Mai
Sun Yat-sen University
Qingmei Li
Qingmei Li
Tsinghua University
Remote SensingSpatial Analysis
Z
Zjin Liao
Sun Yat-sen University
Y
Yibin Wen
Sun Yat-sen University
Y
Yuhang Chen
Sun Yat-sen University
X
Xiaoya Fan
Southwest University
C
Chan Tsz Ho
Sun Yat-sen University
B
Bi Tianyuan
Sun Yat-sen University
H
Haoyuan Liang
Sun Yat-sen University
R
Ruifeng Su
Sun Yat-sen University
Z
Zihao Qian
Sun Yat-sen University
J
Juepeng Zheng
Tsinghua Shenzhen International Graduate School, National Supercomputing Center in Shenzhen
Jianxi Huang
Jianxi Huang
Professor in China Agricultural University
Data assimilationClimate changeAgricultural remote sensingCrop modeling with remote sensing data assimilationCrop yield
Y
Yutong Lu
Sun Yat-sen University, National Supercomputing Center in Shenzhen
Haohuan Fu
Haohuan Fu
Tsinghua University