🤖 AI Summary
Existing Earth observation (EO) benchmarks emphasize semantic understanding but lack quantitative regression capabilities for biophysical variables, hindering qualitative–quantitative synergistic analysis by vision-language models (VLMs) in scientific domains such as forest ecology. To address this gap, we introduce REO-Instruct—the first unified benchmark for EO description and regression—covering four task categories: human activity recognition, land-cover classification, ecological patch counting, and aboveground biomass estimation. REO-Instruct integrates Sentinel-2 optical and ALOS-2 SAR imagery, employs human–AI collaborative generation of structured textual annotations, and establishes a joint evaluation protocol for cross-modal alignment and numerical reasoning. Experiments reveal severe performance limitations of mainstream VLMs on regression tasks. This work fills a critical void in remote sensing by providing the first integrated “description + scientific reasoning” benchmark, offering a standardized evaluation platform and technical guidance for advancing geospatial VLMs.
📝 Abstract
Recent progress in vision language models (VLMs) has enabled remarkable perception and reasoning capabilities, yet their potential for scientific regression in Earth Observation (EO) remains largely unexplored. Existing EO datasets mainly emphasize semantic understanding tasks such as captioning or classification, lacking benchmarks that align multimodal perception with measurable biophysical variables. To fill this gap, we present REO-Instruct, the first unified benchmark designed for both descriptive and regression tasks in EO. REO-Instruct establishes a cognitively interpretable logic chain in forest ecological scenario (human activity,land-cover classification, ecological patch counting, above-ground biomass (AGB) regression), bridging qualitative understanding and quantitative prediction. The dataset integrates co-registered Sentinel-2 and ALOS-2 imagery with structured textual annotations generated and validated through a hybrid human AI pipeline. Comprehensive evaluation protocols and baseline results across generic VLMs reveal that current models struggle with numeric reasoning, highlighting an essential challenge for scientific VLMs. REO-Instruct offers a standardized foundation for developing and assessing next-generation geospatial models capable of both description and scientific inference. The project page are publicly available at href{https://github.com/zhu-xlab/REO-Instruct}{REO-Instruct}.