Asking like Socrates: Socrates helps VLMs understand remote sensing images

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A pervasive “pseudo-reasoning” problem exists in remote sensing multimodal reasoning: models generate answers based on linguistic self-consistency rather than visual evidence, primarily due to single-pass coarse perception (the “Glance Effect”) of high-resolution remote sensing imagery. To address this, we propose RS-EoT—a language-driven iterative visual evidence search paradigm—and introduce SocraticAgent, a Socratic-style multi-agent framework. SocraticAgent integrates self-play, fine-grained object localization via reinforcement learning, and a two-stage progressive vision-language pretraining and VQA fine-tuning pipeline, enabling an interpretable evidence-reasoning closed loop. RS-EoT significantly mitigates the Glance Effect, achieving state-of-the-art performance across multiple remote sensing visual question answering and referring expression grounding benchmarks. Empirical evaluation confirms its capacity for transparent, traceable visual evidence interaction and stepwise reasoning.

Technology Category

Application Category

📝 Abstract
Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision-language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: first, RL on fine-grained Grounding tasks to enhance RS-EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS-EoT mitigates the Glance Effect and enables genuine evidence-grounded reasoning. Our code, data, and models are available at https://geox-lab.github.io/Asking_like_Socrates
Problem

Research questions and friction points this paper is trying to address.

Addresses pseudo reasoning in remote sensing vision-language models
Mitigates the Glance Effect for evidence-based image understanding
Enables iterative visual evidence-seeking in remote sensing tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-driven iterative visual evidence-seeking paradigm
Self-play multi-agent system synthesizing reasoning traces
Two-stage progressive reinforcement learning strategy
🔎 Similar Papers
No similar papers found.
R
Run Shao
School of Geosciences and Info-Physics, Central South University, Changsha, China
Ziyu Li
Ziyu Li
Philips I&D Data & AI
Knowledge ExtractionQuery OptimizationMachine LearningGraph
Z
Zhaoyang Zhang
School of Geosciences and Info-Physics, Central South University, Changsha, China
L
Linrui Xu
School of Geosciences and Info-Physics, Central South University, Changsha, China
X
Xinran He
Baidu Inc., Beijing, China
H
Hongyuan Yuan
School of Geosciences and Info-Physics, Central South University, Changsha, China
B
Bolei He
Baidu Inc., Beijing, China
Y
Yongxing Dai
Baidu Inc., Beijing, China
Y
Yiming Yan
School of Earth Sciences, Zhejiang University, Hangzhou, China
Y
Yijun Chen
School of Earth Sciences, Zhejiang University, Hangzhou, China
W
Wang Guo
School of Geosciences and Info-Physics, Central South University, Changsha, China
Haifeng Li
Haifeng Li
Central South University
GISRemote sensingMachine learningSparse represetationBrain Theory