UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

186K/year
🤖 AI Summary
This study addresses the "resolution illusion" in Earth observation vision-language models (VLMs), where ultra-high-resolution inputs fail to enhance perception of minute targets. The authors formally define and diagnose this issue, introducing UHR-Micro—a novel benchmark comprising 11,253 fine-grained instructions with spatial annotations—and propose Micro-evidence Active Perception (MAP). MAP departs from conventional whole-image processing by decomposing queries and actively retrieving candidate regions to enable micro-evidence-centered local reasoning. Experiments reveal that state-of-the-art high-resolution VLMs exhibit limited performance on micro-target tasks, whereas the MAP agent substantially improves spatial localization and evidence interpretation capabilities. This work establishes a new evaluation benchmark and offers a promising pathway for advancing Earth observation VLMs.
📝 Abstract
Vision-Language Models (VLMs) increasingly operate on ultra-high-resolution (UHR) Earth observation imagery, yet they remain vulnerable to a severe scale mismatch between large-scale scene context and micro-scale targets. We refer to this empirical gap as a "resolution illusion": higher input resolution provides the appearance of richer visual detail, but does not necessarily yield reliable perception of spatially small, task-relevant evidence. To benchmark this challenge, we introduce UHR-Micro, a benchmark comprising 11,253 instructions grounded in 1,212 UHR images, designed to evaluate VLMs at the spatial limits of native Earth observation imagery. UHR-Micro spans diverse micro-target scales, context requirements, task families, and visual conditions, and provides diagnostic annotations that support controlled evaluation and fine-grained error attribution. Experiments with representative high-resolution VLMs show substantial failures in spatial grounding and evidence parsing, despite access to high-resolution inputs. Further analysis suggests that these failures are not fully resolved by increasing model capacity, but are closely tied to insufficient guidance in locating and using task-relevant micro-evidence. Motivated by this finding, we propose Micro-evidence Active Perception (MAP), a reference agent that decomposes queries into evidence-seeking steps, actively inspects candidate regions, and grounds its answers in localized observations. MAP-Agent improves micro-level perception by making high-resolution reasoning evidence-centered rather than image-centered. Together, UHR-Micro and MAP-Agent provide a diagnostic platform for evaluating, understanding, and advancing high-resolution reasoning in Earth observation VLMs. Datasets and source code were released at https://github.com/MiliLab/UHR-Micro.
Problem

Research questions and friction points this paper is trying to address.

resolution illusion
ultra-high-resolution
Earth observation
micro-scale targets
vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

resolution illusion
ultra-high-resolution Earth observation
micro-target perception
Vision-Language Models
active perception
🔎 Similar Papers
No similar papers found.
S
Shuo Ni
National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing, Beijing Institute of Technology, Beijing, China
T
Tong Wang
National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing, Beijing Institute of Technology, Beijing, China
J
Jing Zhang
School of Computer Science, Wuhan University, Wuhan, China
He Chen
He Chen
Chinese University of Hong Kong
Mathematical Programming
Haonan Guo
Haonan Guo
LIESMARS, Wuhan University
Ning Zhang
Ning Zhang
Dongguan University of Technology
optimization
Bo Du
Bo Du
Department of Management, Griffith Business School
Sustainable TransportTravel BehaviourUrban Data AnalyticsLogistics and Supply Chain