🤖 AI Summary
This work addresses the critical reliance on expert knowledge in remote sensing–based lithological interpretation and the absence of systematic benchmarks for evaluating geological semantic understanding in large models. To bridge this gap, we introduce LithoBench—the first benchmark specifically designed for remote sensing lithological interpretation that integrates multi-level geological semantics and expert assessment. LithoBench comprises 10,000 expert-annotated samples across 12 lithological classes, organized into 4,000 multiple-choice and 6,000 open-ended questions structured along five levels of cognitive complexity. Data validity and evaluation reliability are ensured through an expert-in-the-loop, structured image description, and semi-automated collaborative construction pipeline. Experimental results demonstrate that prevailing large models exhibit significant deficiencies in high-order geological reasoning tasks, thereby validating the necessity and effectiveness of LithoBench.
📝 Abstract
Remote sensing lithology interpretation is fundamental to geological surveys, mineral exploration, and regional geological mapping. Unlike general land-cover recognition, lithology interpretation is a knowledge-intensive task that requires experts to infer rock types from various features, e.g., subtle visual, spectral, textural, geomorphological, and contextual cues, making reliable automated interpretation highly challenging. Geological knowledge-guided large multimodal models offer new opportunities, yet their evaluation remains constrained by the lack of benchmarks that capture lithological annotations, multi-level geological semantics, and expert-informed assessment. Here, we propose LithoBench, a multi-level benchmark for evaluating geological semantic understanding in remote sensing lithology interpretation. LithoBench contains 10,000 expert-annotated interpretation instances across 12 representative lithological categories, including 4,000 multiple-choice and 6,000 open-ended tasks organized into five cognitive levels: Identification and Description, Comparative Analysis, Mechanism Explanation, Practical Application, and Comprehensive Reasoning. We further develop an expert-in-the-loop, knowledge-grounded semi-automated construction pipeline, coupling multi sub-processes, e.g., structured geological image descriptions, to enhance geological validity and evaluation reliability. Experiments with multiple large vision-language models eveal substantial limitations in geological semantic understanding, particularly on higher-order explanation, application, and reasoning tasks.