UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study addresses the "resolution illusion" in Earth observation vision-language models (VLMs), where ultra-high-resolution inputs fail to enhance perception of minute targets. The authors formally define and diagnose this issue, introducing UHR-Micro—a novel benchmark comprising 11,253 fine-grained instructions with spatial annotations—and propose Micro-evidence Active Perception (MAP). MAP departs from conventional whole-image processing by decomposing queries and actively retrieving candidate regions to enable micro-evidence-centered local reasoning. Experiments reveal that state-of-the-art high-resolution VLMs exhibit limited performance on micro-target tasks, whereas the MAP agent substantially improves spatial localization and evidence interpretation capabilities. This work establishes a new evaluation benchmark and offers a promising pathway for advancing Earth observation VLMs.

📝 Abstract

Vision-Language Models (VLMs) increasingly operate on ultra-high-resolution (UHR) Earth observation imagery, yet they remain vulnerable to a severe scale mismatch between large-scale scene context and micro-scale targets. We refer to this empirical gap as a "resolution illusion": higher input resolution provides the appearance of richer visual detail, but does not necessarily yield reliable perception of spatially small, task-relevant evidence. To benchmark this challenge, we introduce UHR-Micro, a benchmark comprising 11,253 instructions grounded in 1,212 UHR images, designed to evaluate VLMs at the spatial limits of native Earth observation imagery. UHR-Micro spans diverse micro-target scales, context requirements, task families, and visual conditions, and provides diagnostic annotations that support controlled evaluation and fine-grained error attribution. Experiments with representative high-resolution VLMs show substantial failures in spatial grounding and evidence parsing, despite access to high-resolution inputs. Further analysis suggests that these failures are not fully resolved by increasing model capacity, but are closely tied to insufficient guidance in locating and using task-relevant micro-evidence. Motivated by this finding, we propose Micro-evidence Active Perception (MAP), a reference agent that decomposes queries into evidence-seeking steps, actively inspects candidate regions, and grounds its answers in localized observations. MAP-Agent improves micro-level perception by making high-resolution reasoning evidence-centered rather than image-centered. Together, UHR-Micro and MAP-Agent provide a diagnostic platform for evaluating, understanding, and advancing high-resolution reasoning in Earth observation VLMs. Datasets and source code were released at https://github.com/MiliLab/UHR-Micro.

Problem

Research questions and friction points this paper is trying to address.

resolution illusion

ultra-high-resolution

Earth observation

micro-scale targets

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

resolution illusion

ultra-high-resolution Earth observation

micro-target perception