🤖 AI Summary
This study addresses the challenge that existing multimodal large language models (MLLMs) for remote sensing struggle to distinguish surface objects with similar textures but differing heights due to their limited exploitation of 3D structural information. To this end, the authors introduce VertiCue-Bench, the first diagnostic benchmark integrating canopy height models (CHMs), comprising 17 tasks and 1,534 samples. Leveraging a counterfactual modality ablation strategy and a multitask evaluation framework, the benchmark disentangles low-level height perception from high-level semantic reasoning. Systematic evaluations across 14 state-of-the-art MLLMs reveal that, despite rudimentary CHM awareness, these models significantly underperform RGB-only baselines on semantic tasks requiring joint geometric and appearance constraints, thereby exposing a critical bottleneck in geometric-to-semantic translation.
📝 Abstract
Multimodal Large Language Models (MLLMs) have recently shown promising progress in geospatial reasoning. However, existing remote sensing benchmarks remain largely 2D-centric, evaluating models primarily on optical appearance. In natural environments, this paradigm breaks down due to severe spectral confusion, where ecologically distinct regions share similar textures but differ fundamentally in vertical structure. In such cases, explicit 3D structural data, such as Canopy Height Models (CHMs), become essential geometric evidence for semantic disambiguation. Yet, it remains unclear whether current MLLMs can genuinely leverage vertical cues to resolve appearance-level ambiguity. To address this gap, we introduce VertiCue-Bench, the first diagnostic benchmark for CHM-grounded geospatial reasoning. VertiCue-Bench comprises 1,534 carefully curated instances across 17 tasks, explicitly disentangling low-level height perception from ambiguity-aware semantic reasoning. Evaluations on 14 state-of-the-art general and remote-sensing-specialized MLLMs, combined with counterfactual modality testing, reveal a striking perception-reasoning dissociation. While models exhibit emerging competence in reading raw CHM height cues, they largely fail to translate geometric perception into reliable semantic reasoning, often underperforming RGB-only baselines when joint constraints are required. Overall, VertiCue-Bench exposes a critical geometry-to-semantics gap in natural scene understanding, offering actionable insights for advancing geospatial MLLMs.