HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing multimodal large language models exhibit limited performance on hyperspectral remote sensing image understanding tasks and lack dedicated evaluation benchmarks. To address this gap, this work introduces HM-Bench, the first multimodal benchmark specifically designed for hyperspectral imagery, comprising 13 task categories and 19,337 question-answer pairs. The authors propose a dual-modality evaluation framework that transforms hyperspectral data into PCA-synthesized images and structured textual reports to enable systematic model assessment. Experiments across 18 state-of-the-art multimodal large models reveal their constrained capabilities in complex spectral-spatial reasoning tasks and demonstrate that visual inputs yield significantly better performance than text-only inputs, underscoring the critical role of visual representation in hyperspectral understanding.

Technology Category

Application Category

📝 Abstract

While multimodal large language models (MLLMs) have made significant strides in natural image understanding, their ability to perceive and reason over hyperspectral image (HSI) remains underexplored, which is a vital modality in remote sensing. The high dimensionality and intricate spectral-spatial properties of HSI pose unique challenges for models primarily trained on RGB data.To address this gap, we introduce Hyperspectral Multimodal Benchmark (HM-Bench), the first benchmark designed specifically to evaluate MLLMs in HSI understanding. We curate a large-scale dataset of 19,337 question-answer pairs across 13 task categories, ranging from basic perception to spectral reasoning. Given that existing MLLMs are not equipped to process raw hyperspectral cubes natively, we propose a dual-modality evaluation framework that transforms HSI data into two complementary representations: PCA-based composite images and structured textual reports. This approach facilitates a systematic comparison of different representation for model performance. Extensive evaluations on 18 representative MLLMs reveal significant difficulties in handling complex spatial-spectral reasoning tasks. Furthermore, our results demonstrate that visual inputs generally outperform textual inputs, highlighting the importance of grounding in spectral-spatial evidence for effective HSI understanding. Dataset and appendix can be accessed at https://github.com/HuoRiLi-Yu/HM-Bench.

Problem

Research questions and friction points this paper is trying to address.

multimodal large language models

hyperspectral image

remote sensing

spectral-spatial reasoning

high dimensionality

Innovation

Methods, ideas, or system contributions that make the work stand out.

hyperspectral remote sensing

multimodal large language models

benchmark